Nevis particle-physics administrative cluster configuration

This is a reference page. It contains a text file that describes how the high-availability pacemaker configuration was set up on two administrative servers, hypatia and orestes.

Files

Key HA configuration files. Note: Even in an emergency, there's no reason to edit these files!

/etc/cluster/cluster.conf
/etc/lvm/lvm.conf
/etc/drbd.d/global_common.conf
/etc/drbd.d/admin.res
/home/bin/fence_nut.pl
/etc/rc.d/rc.local
/home/bin/recover-symlinks.sh
/etc/rc.d/rc.fix-pacemaker-delay (on hypatia only)

The links are to an external site, pastebin; I use this in case I want to consult with someone on the HA setup. If you're reading this from a hardcopy, you can find all these files by visiting http://pastebin.com/u/wgseligman and searching for 20130103.

One-time set-up

The commands to set up a dual-primary cluster are outlined here. The details can be found in Clusters From Scratch and Redhat Cluster Tutorial.

Warning: Do not type any of these commands in the hopes of fixing a problem! They will erase the shared DRBD drive.

DRBD set-up

Edit /etc/drbd.d/global_common.conf and create /etc/drbd.d/admin.res. Then on hypatia:

/sbin/drbdadm create-md admin
/sbin/service drbd start
/sbin/drbdadm up admin

Then on orestes:

/sbin/drbdadm --force create-md admin
/sbin/service drbd start
/sbin/drbdadm up admin

Back to hypatia:

/sbin/drbdadm -- --overwrite-data-of-peer primary admin
cat /proc/drbd

Keeping looking at the contents of /proc/drbd. It will take a while, but eventually the two disks will sync up.

Back to orestes:

/sbin/drbdadm primary admin
cat /proc/drbd

The result should be something like this:

# cat /proc/drbd
version: 8.4.1 (api:1/proto:86-100)
GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by root@hypatia-tb.nevis.columbia.edu, 2012-02-14 17:04:51
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:162560777 nr:78408289 dw:240969067 dr:747326438 al:10050 bm:1583 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Here's a guide to understanding the contents of /proc/drbd.

Clustered LVM setup

Most of the following commands only have to be issued on one of the nodes. See Clusters From Scratch and Redhat Cluster Tutorial for details.

  • Edit /etc/lvm/lvm.conf on both systems; search this file for the initials WGS for a complete list of changes.
    • Change the filter line to search for DRBD partitions:
      filter = [ "a|/dev/drbd.*|", "a|/dev/md1|", "r|.*|" ]
    • For lvm locking:
      locking_type = 3

  • Edit /etc/sysconfig/cman to disable quorum (because there's only two nodes on the cluster):
    sed -i.sed "s/.*CMAN_QUORUM_TIMEOUT=.*/CMAN_QUORUM_TIMEOUT=0/g" /etc/sysconfig/cman

  • Create a physical volume and a clustered volume group on the DRBD partition:
    pvcreate /dev/drbd0
    vgcreate -c y ADMIN /dev/drbd0

  • For each logical volume in the volume group, create the volume and install a GFS2 filesystem; for example:
    lvcreate -L 200G -n usr ADMIN # ... and so on
    mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr
    Note that Nevis_HA is the cluster name defined in /etc/cluster/cluster.conf.

  • Make sure that cman, clvm2, and pacemaker daemons will start at boot; on both nodes, do:
    /sbin/chkconfig cman on
    /sbin/chkconfig clvmd on
    /sbin/chkconfig pacemaker on

  • Reboot both nodes.

Pacemaker configuration

Commands

The configuration has definitely changed from that listed below. To see the current configuration, run this as root on either hypatia or orestes:

crm configure show
To see the status of all the resources:
crm resource status
To get a constantly-updated display of the resource status, the following command is the corosync equivalent of "top" (use Ctrl-C to exit):
crm_mon
You can run the above commands via sudo, but you'll have to extend your path; e.g.,
export PATH=/sbin:/usr/sbin:${PATH}
sudo crm_mon

Concepts

This may help as you work your way through the configuration:

crm configure primitive MyIPResource ocf:heartbeat:IPaddr2 params ip=192.168.85.3 \
   cidr_netmask=32 op monitor interval=30s

# Which is composed of
    * crm ::= "cluster resource manager", the command we're executing
    * primitive ::= The type of resource object that we’re creating.
    * MyIPResource ::= Our name for the resource
    * IPaddr2 ::= The script to call
    * ocf ::= The standard it conforms to
    * ip=192.168.85.3 ::= Parameter(s) as name/value pairs
    * cidr_netmask ::= netmask; 32-bits means use this exact IP address
    * op ::== what follows are options
    * monitor interval=30s ::= check every 30 seconds that this resource is working

# ... timeout = how to long wait before you assume a resource is dead. 

How to find out which scripts exist, that is, which resources can be controlled by the HA cluster:

crm ra classes
Based on the result, I looked at:
crm ra list ocf heartbeat
To find out what IPaddr2 parameters I needed, I used:
crm ra meta ocf:heartbeat:IPaddr2

Initial configuration guide

This work was done in Apr-2012. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them.

# The commands ultimately used to configure the high-availability (HA) servers:

# The beginning: make sure pacemaker is running on both hypatia and orestes:

/sbin/service pacemaker status
crm node status
crm resource status

# We'll configure STONITH later (see below)

crm configure property stonith-enabled=false

# Let's continue by entering the crm utility for short sessions. I'm going to 
# test groups of commands before I commit them. I omit the "crm configure show' 
# and "crm status" commands that I frequently typed in, in order to see that 
# everything was correct. 

# I also omit the standard resource options
# (e.g., "... op monitor interval="20" timeout="40" depth="0"...) to make the
# commands look simpler. This particular option means to check that the
# resource is running every 20 seconds, and to declare that the monitor operation
# will generate an error if 40 seconds elapse without a response. You can see the
# complete list with "crm configure show".  
   
# DRBD is a service that synchronizes the hard drives between two machines.
# When one machine makes any change to the DRBD disk, the other machine 
# immediately duplicates that change on the block level. We have a dual-primary
# configuration, which means both machines can mount the DRBD disk at once.

# Start by entering the resource manager. 

crm

   # Define a "shadow" configuration, to test things without committing them
   # to the HA cluster:
   cib new drbd
   
   # The "drbd_resource" parameter points to a configuration defined in /etc/drbd.d/admin.res
   
   primitive AdminDrbd ocf:linbit:drbd \
                params drbd_resource="admin" \
                meta target-role="Master"
   
   # The following resources defines how the DRBD resource (AdminDrbd) is to
        # duplicated ("cloned") among the nodes. The parameters clarify that there are 
        # two copies, one on each node, and both can be the master.
   
   ms AdminClone AdminDrbd \
                meta master-max="2" master-node-max="1" clone-max="2" clone-node-max="1" \
                notify="true" interleave="true"

        configure show

        # Looks good, so commit the change. 
   cib commit drbd
   quit

# Now define resources that depend on ordering.
crm
   cib new disk
   
        # The DRBD is available to the system. The next step is to tell LVM
        # that the volume group ADMIN exists on the disk. 

   # To find out that there was a resource "ocf:heartbeat:LVM" that I could use,
   # I used the command:
   ra classes
   
   # Based on the result, I looked at:
   
   ra list ocf heartbeat
   
   # To find out what LVM parameters I needed, I used:
   
   ra meta ocf:heartbeat:LVM
   
   # All of the above led me to create the following resource configuration:
   
   primitive AdminLvm ocf:heartbeat:LVM \
      params volgrpname="ADMIN" 

   # After I set up the volume group, I want to mount the logical volumes
        # (partitions) within the volume group. Here's one of the partitions, /usr/nevis;
        # note that I begin all the filesystem resources with FS so they'll be next
        # to each other when I type "crm configure show".

   primitive FSUsrNevis ocf:heartbeat:Filesystem \
      params device="/dev/mapper/ADMIN-usr" directory="/usr/nevis" fstype="gfs2" \
                options="defaults,noatime,nodiratime" 
      
   # I have similar definitions for the other logical volumes in volume group ADMIN:
        # /mail, /var/nevis, etc. 
   
        # Now I'm going to define a resource group. The following command means:
        #    - Put all these resources on the same node;
        #    - Start these resources in the order they're listed;
        #    - The resources depend on each other in the order they're listed. For example,
        #       if AdminLvm fails, FSUsrNevis will not start, or will be stopped if it's running.

       group FilesystemGroup AdminLvm FSUsrNevis FSVarNevis FSVirtualMachines FSMail FSWork

        # We want these logical volumes (or partitions or filesystems) to be available
        # on both nodes. To do this, we define a clone resource. 

        clone FilesystemClone FilesystemGroup meta interleave="true"
   
   # One more thing: It's important that we not try to set up the filesystems
        # until the DRBD admin resource is running on a node, and has been
        # promoted to master. 
   
   # A score of "inf" means "infinity"; if the DRBD resource 'AdminClone' can't
   # be promoted, then don't start the 'FilesystemClone' resource. 
   
   colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
        order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
   
   cib commit disk
   quit

# Some standard Linux services are under corosync's control. They depend on some or
# all of the filesystems being mounted. 
   
# Let's start with a simple one: enable the printing service (cups):

crm
   cib new printing
   
   # lsb = "Linux Standard Base." It just means any service which is
   # controlled by the one of the standard scripts in /etc/init.d
   
   configure primitive Cups lsb:cups
   
   # Cups stores its spool files in /var/spool/cups. If the cups service 
   # were to switch to a different server, we want the new server to see 
   # the spooled files. So create /var/nevis/cups, link it with:
   #   mv /var/spool/cups /var/spool/cups.ori
   #   ln -sf /var/nevis/cups /var/spool/cups
   # and demand that the cups service only start if /var/nevis (and the other
   # high-availability directories) have been mounted.
   
   configure colocation CupsWithVar inf: Cups AdminDirectoriesGroup
   
   # In order to prevent chaos, make sure that the high-availability directories
   # have been mounted before we try to start cups.
   
   configure order VarBeforeCups inf: AdminDirectoriesGroup Cups
   
   cib commit printing
   quit

# The other services (xinetd, dhcpd) follow the same pattern as above:
# Make sure the services start on the same machine as the admin directories,
# and after the admin directories are successfully mounted.

crm
   cib new services
   
   configure primitive Xinetd lsb:xinetd
   configure primitive Dhcpd lsb:dhcpd
   
   configure colocation XinetdWithVar inf: Xinetd AdminDirectoriesGroup
   configure order VarBeforeXinetd inf: VarDirectory Xinetd
   
   configure colocation DhcpdWithVar inf: Dhcpd AdminDirectoriesGroup
   configure order VarBeforeDhcpd inf: VarDirectory Dhcpd
   
   cib commit services
   quit

# The high-availability servers export some of the admin directories to other
# systems, both real and virtual; for example, the /usr/nevis directory is 
# exported to all the other machines on the Nevis Linux cluster. 

# NFS exporting of a shared directory can be a little tricky. As with CUPS 
# spooling, we want to preserve the NFS export state in a way that the 
# backup server can pick it up. The safest way to do this is to create a 
# small separate LVM partition ("nfs") and mount it as "/var/lib/nfs",
# the NFS directory that contains files that keep track of the NFS state.

crm
   cib new nfs
   
   # Define the mount for the NFS state directory /var/lib/nfs
   
   configure primitive NfsStateDirectory ocf:heartbeat:Filesystem \
         params device="/dev/admin/nfs" directory="/var/lib/nfs" fstype="ext4"
   configure colocation NfsStateWithVar inf: NfsStateDirectory AdminDirectoriesGroup
   configure order VarBeforeNfsState inf: AdminDirectoriesGroup NfsStateDirectory

   # Now that the NFS state directory is mounted, we can start nfslockd. Note that
   # that we're starting NFS lock on both the primary and secondary HA systems;
   # by default a "clone" resource is started on all systems in a cluster. 

   # (Placing nfslockd under the control of Pacemaker turned out to be key to
   # successful transfer of cluster services to another node. The nfslockd and
   # nfs daemon information stored in /var/lib/nfs have to be consistent.)

   configure primitive NfsLockInstance lsb:nfslock
   configure clone NfsLock NfsLockInstance

   configure order NfsStateBeforeNfsLock inf: NfsStateDirectory NfsLock

   # Once nfslockd has been set up, we can start NFS. (We say to colocate
   # NFS with 'NfsStateDirectory', instead of nfslockd, because nfslockd
   # is going to be started on both nodes.)
   
   configure primitive Nfs lsb:nfs
   configure colocation NfsWithNfsState inf: Nfs NfsStateDirectory
   configure order NfsLockBeforeNfs inf: NfsLock Nfs 
   
   cib commit nfs
   quit

   
# The whole point of the entire setup is to be able to run guest virtual machines 
# under the control of the high-availability service. Here is the set-up for one example
# virtual machine. I previously created the hogwarts virtual machine and copied its
# configuration to /xen/configs/hogwarts.cfg.

# I duplicated the same procedure for franklin (mail server), ada (web server), and
# so on, but I don't show that here.

crm
   cib new hogwarts
   
   # Give the virtual machine a long stop interval before flagging an error.
   # Sometimes it takes a while for Linux to shut down.
   
   configure primitive Hogwarts ocf:heartbeat:Xen params \
      xmfile="/xen/configs/Hogwarts.cfg" \
         op stop interval="0" timeout="240"

   # All the virtual machine files are stored in the /xen partition, which is one
   # of the high-availability admin directories. The virtual machine must run on
   # the system with this directory.

   configure colocation HogwartsWithDirectories inf: Hogwarts AdminDirectoriesGroup

   # All of the virtual machines depend on NFS-mounting directories which
   # are exported by the HA server. The safest thing to do is to make sure
   # NFS is running on the HA server before starting the virtual machine.
   
   configure order NfsBeforeHogwarts inf: Nfs Hogwarts
   
   cib commit hogwarts
   quit


# An important part of a high-availability configuration is STONITH = "Shoot the
# other node in the head." Here's the idea: suppose one node fails for some reason. The
# other node will take over as needed. 

# Suppose the failed node tries to come up again. This can be a problem: The other node
# may have accumulated changes that the failed node doesn't know about. There can be
# synchronization issues that require manual intervention.

# The STONITH mechanism means: If a node fails, the remaining node(s) in a cluster will
# force a permanent shutdown of the failed node; it can't automatically come back up again.
# This is a special case of "fencing": once a node or resource fails, it can't be allowed
# to start up again automatically.

# In general, there are many ways to implement a STONITH mechanism. At Nevis, the way
# we do it is to shut-off the power on the UPS connected to the failed node.

# (By the way, this is why you have to be careful about restarting hypatia or orestes.
# The STONITH mechanism may cause the UPS on the restarting
# computer to turn off the power; it will never come back up.)

# At Nevis, the UPSes are monitored and controlled using the NUT package
# <http://www.networkupstools.org/>; details are on the Nevis wiki at
# <http://www.nevis.columbia.edu/twiki/bin/view/Nevis/Ups>.

# The official corosync distribution from <http://www.clusterlabs.org/> 
# does not include a script for NUT, so I had to write one. It's located at
# /home/bin/nut.sh on both hypatia and orestes; there are appropriate links
# to this script from the stonith/external directory. 

# The following commands implement the STONITH mechanism for our cluster:

crm
   cib new stonith
   
   # The STONITH resource that can potentially shut down hypatia.
   
   configure primitive HypatiaStonith stonith:external/nut \
      params hostname="hypatia.nevis.columbia.edu" \
      ups="hypatia-ups" username="admin" password="acdc"
      
   # The node that runs the above script cannot be hypatia; it's
   # not wise to trust a node to STONITH itself. Note that the score
   # is "negative infinity," which means "never run this resource
   # on the named node."

   configure location HypatiaStonithLoc HypatiaStonith -inf: hypatia.nevis.columbia.edu

   # The STONITH resource that can potentially shut down orestes.

   configure primitive OrestesStonith stonith:external/nut \
      params hostname="orestes.nevis.columbia.edu" \
      ups="orestes-ups" username="admin" password="acdc"

   # Again, orestes cannot be the node that runs the above script.
   
   configure location OresetesStonithLoc OrestesStonith -inf: orestes.nevis.columbia.edu
   
   cib commit stonith
   quit

# Now turn the STONITH mechanism on for the cluster.

crm configure property stonith-enabled=true


This topic: Main > Computing > PacemakerDualPrimaryConfiguration
Topic revision: r5 - 2013-01-04 - WilliamSeligman
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback