Difference: PacemakerDualPrimaryConfiguration (1 vs. 11)

Revision 112014-07-01 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Nevis particle-physics administrative cluster configuration

Line: 6 to 6
 
Added:
>
>
Archived 20-Sep-2013: The high-availability cluster has been set aside in favor of a more traditional single-box admin server. HA is grand in theory, but in the three years we operated the cluster we had no hardware problems which the HA set-up would have prevented, but many hours of downtime due to problems with the HA software. This mailing-list post has some details.
 This is a reference page. It contains a text file that describes how the high-availability pacemaker configuration was set up on two administrative servers, hypatia and orestes.

Files

Revision 102013-07-01 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Nevis particle-physics administrative cluster configuration

Line: 151 to 151
 

Initial configuration guide

Changed:
<
<
This work was done in Apr-2012. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. The entire final configuration is on pastebin: http://pastebin.com/fRJMAYa6.
>
>
This work was done in Apr-2012. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. The entire final configuration is on pastebin: http://pastebin.com/QcxuvfK0.
 
# The commands ultimately used to configure the high-availability (HA) servers:

Revision 92013-01-29 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Nevis particle-physics administrative cluster configuration

Line: 98 to 98
 

Commands

Changed:
<
<
The configuration has definitely changed from that listed below. To see the current configuration, run this as root on either hypatia or orestes:
>
>
The configuration has definitely changed from that listed below. To see the current configuration, run this as root on either hypatia or orestes:
 
crm configure show

Revision 82013-01-09 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Nevis particle-physics administrative cluster configuration

Line: 78 to 78
 
    • Change the filter line to search for DRBD partitions:
      filter = [ "a|/dev/drbd.*|", "a|/dev/md1|", "r|.*|" ]
    • For lvm locking:
      locking_type = 3
Changed:
<
<
  • Edit /etc/sysconfig/cman to disable quorum (because there's only two nodes on the cluster):
    sed -i.sed "s/.*CMAN_QUORUM_TIMEOUT=.*/CMAN_QUORUM_TIMEOUT=0/g" /etc/sysconfig/cman
>
>
  • Edit /etc/sysconfig/cman to disable quorum (because there's only two nodes on the cluster):
    sed -i.sed "s/.*CMAN_QUORUM_TIMEOUT=.*/CMAN_QUORUM_TIMEOUT=0/g" /etc/sysconfig/cman
 
Changed:
<
<
  • Create a physical volume and a clustered volume group on the DRBD partition:
    pvcreate /dev/drbd0
>
>
  • Create a physical volume and a clustered volume group on the DRBD partition. The name of the DRBD disk is /dev/drbd0; the name of the volume group is ADMIN.
    pvcreate /dev/drbd0
 vgcreate -c y ADMIN /dev/drbd0
Changed:
<
<
  • For each logical volume in the volume group, create the volume and install a GFS2 filesystem; for example:
    lvcreate -L 200G -n usr ADMIN # ... and so on
>
>
  • For each logical volume in the volume group, create the volume and install a GFS2 filesystem; for example, the following creates a logical volume usr within the volume group ADMIN:
    lvcreate -L 200G -n usr ADMIN 
 mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usrNote that Nevis_HA is the cluster name defined in /etc/cluster/cluster.conf.

  • Make sure that cman, clvm2, and pacemaker daemons will start at boot; on both nodes, do:
    /sbin/chkconfig cman on

Revision 72013-01-07 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Nevis particle-physics administrative cluster configuration

Line: 151 to 151
 

Initial configuration guide

Changed:
<
<
This work was done in Apr-2012. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them.
>
>
This work was done in Apr-2012. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. The entire final configuration is on pastebin: http://pastebin.com/fRJMAYa6.
 
# The commands ultimately used to configure the high-availability (HA) servers:
Line: 562 to 562
 
Added:
>
>
Again, the final configuration that results from the above commands is on pastebin.
 
META TOPICMOVED by="WilliamSeligman" date="1348092384" from="Nevis.CorosyncDualPrimaryConfiguration" to="Nevis.PacemakerDualPrimaryConfiguration"

Revision 62013-01-04 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Nevis particle-physics administrative cluster configuration

Line: 121 to 121
 This may help as you work your way through the configuration:

crm configure primitive MyIPResource ocf:heartbeat:IPaddr2 params ip=192.168.85.3 \
Changed:
<
<
cidr_netmask=32 op monitor interval=30s
>
>
cidr_netmask=32 op monitor interval=30s timeout=60s
  # Which is composed of * crm ::= "cluster resource manager", the command we're executing
Line: 133 to 133
  * cidr_netmask ::= netmask; 32-bits means use this exact IP address * op ::== what follows are options * monitor interval=30s ::= check every 30 seconds that this resource is working
Changed:
<
<
# ... timeout = how to long wait before you assume a resource is dead.
>
>
* timeout ::= how long to wait before you assume an "op" is dead.
 

How to find out which scripts exist, that is, which resources can be controlled by the HA cluster:

Line: 172 to 171
 # and "crm status" commands that I frequently typed in, in order to see that # everything was correct.
Changed:
<
<
# I also omit the standard resource options
>
>
# I also omit some of the standard resource options
 # (e.g., "... op monitor interval="20" timeout="40" depth="0"...) to make the
Changed:
<
<
# commands look simpler. This particular option means to check that the # resource is running every 20 seconds, and to declare that the monitor operation # will generate an error if 40 seconds elapse without a response. You can see the
>
>
# commands look simpler. You can see the
 # complete list with "crm configure show".

# DRBD is a service that synchronizes the hard drives between two machines.

Line: 216 to 213
 crm cib new disk

Changed:
<
<
# The DRBD is available to the system. The next step is to tell LVM
>
>
# The DRBD disk is available to the system. The next step is to tell LVM
  # that the volume group ADMIN exists on the disk.

# To find out that there was a resource "ocf:heartbeat:LVM" that I could use,

Line: 261 to 258
  clone FilesystemClone FilesystemGroup meta interleave="true"

Changed:
<
<
# One more thing: It's important that we not try to set up the filesystems
>
>
# It's important that we not try to set up the filesystems
  # until the DRBD admin resource is running on a node, and has been # promoted to master.

Changed:
<
<
# A score of "inf" means "infinity"; if the DRBD resource 'AdminClone' can't # be promoted, then don't start the 'FilesystemClone' resource.
>
>
# A score of "inf:" means "infinity": 'FileSystemClone' must be on a node on which # 'AdminClone' is in the Master state; if the DRBD resource 'AdminClone' can't # be promoted, then don't start the 'FilesystemClone' resource. (You can use numeric # values instead of infinity, in which case these constraints become suggestions # instead of being mandatory.)
  colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
Line: 274 to 274
  cib commit disk quit
Changed:
<
<
# Some standard Linux services are under corosync's control. They depend on some or # all of the filesystems being mounted.

# Let's start with a simple one: enable the printing service (cups):

>
>
# Once all the filesystems are mounted, we start can start other resources. Let's # define a set of cloned IP addresses that will always point to at least one of the nodes, # possibly both.
  crm
Changed:
<
<
cib new printing
>
>
cib new ip

# One address for each network

 
Changed:
<
<
# lsb = "Linux Standard Base." It just means any service which is # controlled by the one of the standard scripts in /etc/init.d
>
>
primitive IP_cluster ocf:heartbeat:IPaddr2 params ip="129.236.252.11" cidr_netmask="32" nic="eth0" primitive IP_cluster_local ocf:heartbeat:IPaddr2 params ip="10.44.7.11" cidr_netmask="32" nic="eth2" primitive IP_cluster_sandbox ocf:heartbeat:IPaddr2 params ip="10.43.7.11" cidr_netmask="32" nic="eth0.3"
 
Changed:
<
<
configure primitive Cups lsb:cups
>
>
# Group them together
 
Changed:
<
<
# Cups stores its spool files in /var/spool/cups. If the cups service # were to switch to a different server, we want the new server to see # the spooled files. So create /var/nevis/cups, link it with: # mv /var/spool/cups /var/spool/cups.ori # ln -sf /var/nevis/cups /var/spool/cups # and demand that the cups service only start if /var/nevis (and the other # high-availability directories) have been mounted.
>
>
group IPGroup IP_cluster IP_cluster_local IP_cluster_sandbox
 
Changed:
<
<
configure colocation CupsWithVar inf: Cups AdminDirectoriesGroup
>
>
# The option "globally-unique=true" works with IPTABLES to make # sure that ethernet connections are not disrupted even if one of # nodes goes down; see "Clusters From Scratch" for details.
 
Changed:
<
<
# In order to prevent chaos, make sure that the high-availability directories # have been mounted before we try to start cups.
>
>
clone IPClone IPGroup meta globally-unique="true" clone-max="2" clone-node-max="2" interleave="false"
 
Changed:
<
<
configure order VarBeforeCups inf: AdminDirectoriesGroup Cups
>
>
# Make sure the filesystems are mounted before starting the IP resources. colocation IP_With_Filesystem inf: IPClone FilesystemClone order Filesystem_Before_IP inf: FilesystemClone IPClone
 
Changed:
<
<
cib commit printing
>
>
cib commit ip
  quit
Changed:
<
<
# The other services (xinetd, dhcpd) follow the same pattern as above: # Make sure the services start on the same machine as the admin directories, # and after the admin directories are successfully mounted.
>
>
# We have to export some of the filesystems via NFS before some of the virtual machines # will be able to run.
  crm
Changed:
<
<
cib new services
>
>
cib new exports
 
Changed:
<
<
configure primitive Xinetd lsb:xinetd configure primitive Dhcpd lsb:dhcpd
>
>
# This is an example NFS export resource; I won't list them all here. See # "crm configure show" for the complete list.

primitive ExportUsrNevis ocf:heartbeat:exportfs description="Site-wide applications installed in /usr/nevis" params clientspec="*.nevis.columbia.edu" directory="/usr/nevis" fsid="20" options="ro,no_root_squash,async"

# Define a group for all the exportfs resources. You can see it's a long list, # which is why I don't list them all explicitly. I had to be careful # about the exportfs definitions; despite the locking mechanims of GFS2, # we'd get into trouble if two external systems tried to write to the same # DRBD partition at once via NFS.

 
Changed:
<
<
configure colocation XinetdWithVar inf: Xinetd AdminDirectoriesGroup configure order VarBeforeXinetd inf: VarDirectory Xinetd
>
>
group ExportsGroup ExportMail ExportMailInbox ExportMailFolders ExportMailForward ExportMailProcmailrc ExportUsrNevisHermes ExportUsrNevis ExportUsrNevisOffsite ExportWWW
 
Changed:
<
<
configure colocation DhcpdWithVar inf: Dhcpd AdminDirectoriesGroup configure order VarBeforeDhcpd inf: VarDirectory Dhcpd
>
>
# Clone the group so both nodes export the partitions. Make sure the # filesystems are mounted before we export them.
 
Changed:
<
<
cib commit services
>
>
clone ExportsClone ExportsGroup colocation Exports_With_Filesystem inf: ExportsClone FilesystemClone order Filesystem_Before_Exports inf: FilesystemClone ExportsClone

cib commit exports

  quit
Changed:
<
<
# The high-availability servers export some of the admin directories to other # systems, both real and virtual; for example, the /usr/nevis directory is # exported to all the other machines on the Nevis Linux cluster.

# NFS exporting of a shared directory can be a little tricky. As with CUPS # spooling, we want to preserve the NFS export state in a way that the # backup server can pick it up. The safest way to do this is to create a # small separate LVM partition ("nfs") and mount it as "/var/lib/nfs", # the NFS directory that contains files that keep track of the NFS state.

>
>
# Symlinks: There are some scripts that I want to run under cron. These scripts are # located in the DRBD /var/nevis file system. For them to run via cron, they have to # found in /etc/cron.d somehow. A symlink is the easiest way, and there's a # symlink pacemaker resource to manage this.
  crm
Changed:
<
<
cib new nfs
>
>
cib new cron
 
Changed:
<
<
# Define the mount for the NFS state directory /var/lib/nfs
>
>
# The ambient-temperature script periodically checks the computer room's # environment monitor, and shuts down the cluster if the temperature gets # too high.

primitive CronAmbientTemperature ocf:heartbeat:symlink description="Shutdown cluster if A/C stops" params link="/etc/cron.d/ambient-temperature" target="/var/nevis/etc/cron.d/ambient-temperature" backup_suffix=".original"

# We don't want to clone this resource; I only want one system to run this script # at any one time.

colocation Temperature_With_Filesystem inf: CronAmbientTemperature FilesystemClone order Filesystem_Before_Temperature inf: FilesystemClone CronAmbientTemperature

# Every couple of months, make a backup of the virtual machine's disk images.

primitive CronBackupVirtualDiskImages ocf:heartbeat:symlink description="Periodically save copies of the virtual machines" params link="/etc/cron.d/backup-virtual-disk-images" target="/var/nevis/etc/cron.d/backup-virtual-disk-images" backup_suffix=".original" colocation BackupImages_With_Filesystem inf: CronBackupVirtualDiskImages FilesystemClone order Filesystem_Before_BackupImages inf: FilesystemClone CronBackupVirtualDiskImages

 
Changed:
<
<
configure primitive NfsStateDirectory ocf:heartbeat:Filesystem params device="/dev/admin/nfs" directory="/var/lib/nfs" fstype="ext4" configure colocation NfsStateWithVar inf: NfsStateDirectory AdminDirectoriesGroup configure order VarBeforeNfsState inf: AdminDirectoriesGroup NfsStateDirectory
>
>
cib commit cron quit
 
Changed:
<
<
# Now that the NFS state directory is mounted, we can start nfslockd. Note that # that we're starting NFS lock on both the primary and secondary HA systems; # by default a "clone" resource is started on all systems in a cluster.
>
>
# These are the most important resources on the HA cluster: the virtual # machines.
 
Changed:
<
<
# (Placing nfslockd under the control of Pacemaker turned out to be key to # successful transfer of cluster services to another node. The nfslockd and # nfs daemon information stored in /var/lib/nfs have to be consistent.)
>
>
crm cib new vm
 
Changed:
<
<
configure primitive NfsLockInstance lsb:nfslock configure clone NfsLock NfsLockInstance
>
>
# In order to start a virtual machine, the libvirtd daemon has to run. The "lsb:" means # "Linux Standard Base", which in turn means any script located in # /etc/init.d.
 
Changed:
<
<
configure order NfsStateBeforeNfsLock inf: NfsStateDirectory NfsLock
>
>
primitive Libvirtd lsb:libvirtd
 
Changed:
<
<
# Once nfslockd has been set up, we can start NFS. (We say to colocate # NFS with 'NfsStateDirectory', instead of nfslockd, because nfslockd # is going to be started on both nodes.)
>
>
# libvirtd looks for configuration files that define the virtual machines. # These files are kept in /var/nevis, like the above cron scripts, and are # "placed" via symlinks.
 
Changed:
<
<
configure primitive Nfs lsb:nfs configure colocation NfsWithNfsState inf: Nfs NfsStateDirectory configure order NfsLockBeforeNfs inf: NfsLock Nfs
>
>
primitive SymlinkEtcLibvirt ocf:heartbeat:symlink params link="/etc/libvirt" target="/var/nevis/etc/libvirt" backup_suffix=".original" primitive SymlinkQemuSnapshot ocf:heartbeat:symlink params link="/var/lib/libvirt/qemu/snapshot" target="/var/nevis/lib/libvirt/qemu/snapshot" backup_suffix=".original"
 
Changed:
<
<
cib commit nfs quit
>
>
# Again, define a group for these resources, clone the group so they # run on both nodes, and make sure they don't run unless the # filesystems are mounted.
 
Added:
>
>
group LibvirtdGroup SymlinkEtcLibvirt SymlinkQemuSnapshot Libvirtd clone LibvirtdClone LibvirtdGroup colocation Libvirtd_With_Filesystem inf: LibvirtdClone FilesystemClone
 
Changed:
<
<
# The whole point of the entire setup is to be able to run guest virtual machines # under the control of the high-availability service. Here is the set-up for one example # virtual machine. I previously created the hogwarts virtual machine and copied its # configuration to /xen/configs/hogwarts.cfg.
>
>
# A tweak: some virtual machines require the directories exportted # by the exportfs resources defined above. Don't start the VMs until # the exports are complete.
 
Changed:
<
<
# I duplicated the same procedure for franklin (mail server), ada (web server), and # so on, but I don't show that here.
>
>
order Exports_Before_Libvirtd inf: ExportsClone LibvirtdClone
 
Changed:
<
<
crm cib new hogwarts
>
>
# The typical definition of a resource that runs a VM. I won't list # them all, just the one for the mail server. Note that all the # virtual-machine resource names start with VM_, so they'll show # up next to each other in the output of "crm configure show".
 
Changed:
<
<
# Give the virtual machine a long stop interval before flagging an error. # Sometimes it takes a while for Linux to shut down.
>
>
# VM migration is a neat feature. If pacemaker has the chance to move # a virtual machine, it can transmit it to another node without stopping it # on the source node and restarting it at the destination. If a machine # crashes, migration can't happen, but it can greatly speed up the # controlled shutdown ofa node.
 
Changed:
<
<
configure primitive Hogwarts ocf:heartbeat:Xen params xmfile="/xen/configs/Hogwarts.cfg" op stop interval="0" timeout="240"
>
>
primitive VM_franklin ocf:heartbeat:VirtualDomain params config="/etc/libvirt/qemu/franklin.xml" \ migration_transport="ssh" meta allow-migrate="true"
 
Changed:
<
<
# All the virtual machine files are stored in the /xen partition, which is one # of the high-availability admin directories. The virtual machine must run on # the system with this directory.
>
>
# We don't want to clone the VMs; it will just confuse things if there # two mail servers (with the same IP address!) running at the same time.
 
Changed:
<
<
configure colocation HogwartsWithDirectories inf: Hogwarts AdminDirectoriesGroup
>
>
colocation Mail_With_Libvirtd inf: VM_franklin LibvirtdClone order Libvirtd_Before_Mail inf: LibvirtdClone VM_franklin
 
Changed:
<
<
# All of the virtual machines depend on NFS-mounting directories which # are exported by the HA server. The safest thing to do is to make sure # NFS is running on the HA server before starting the virtual machine.
>
>
cib commit vm quit

# A less-critical resource is tftp. As above, we define the basic xinetd # resource found in /etc/init.d, include a configure file with a symlink, # then clone the resource and specify it can't run until the filesystems # are mounted.

crm cib new tftp

 
Changed:
<
<
configure order NfsBeforeHogwarts inf: Nfs Hogwarts
>
>
primitive Xinetd lsb:xinetd primitive SymlinkTftp ocf:heartbeat:symlink params link="/etc/xinetd.d/tftp" target="/var/nevis/etc/xinetd.d/tftp" backup_suffix=".original"

group TftpGroup SymlinkTftp Xinetd clone TftpClone TftpGroup colocation Tftp_With_Filesystem inf: TftpClone FilesystemClone order Filesystem_Before_Tftp inf: FilesystemClone TftpClone

 
Changed:
<
<
cib commit hogwarts
>
>
cib commit tftp
  quit
Added:
>
>
# More important is dhcpd, which assigns IP addresses dynamically. # Many systems at Nevis require a DHCP server for their IP address, # include the wireless routers. This follows the same pattern as above, # except that we don't clone the dhcpd daemon, since we want only # one DHCP server at Nevis.

crm cib new dhcp

configure primitive Dhcpd lsb:dhcpd

# Associate an IP address with the DHCP server. This is a mild # convenience for the times I update the list of MAC addresses # to be assigned permanent IP addresses. primitive IP_dhcp ocf:heartbeat:IPaddr2 params ip="10.44.107.11" cidr_netmask="32" nic="eth2"

primitive SymlinkDhcpdConf ocf:heartbeat:symlink params link="/etc/dhcp/dhcpd.conf" target="/var/nevis/etc/dhcpd.conf" backup_suffix=".original" primitive SymlinkDhcpdLeases ocf:heartbeat:symlink params link="/var/lib/dhcpd" target="/var/nevis/dhcpd" backup_suffix=".original" primitive SymlinkSysconfigDhcpd ocf:heartbeat:symlink params link="/etc/sysconfig/dhcpd" target="/var/nevis/etc/sysconfig/dhcpd" backup_suffix=".original"

group DhcpGroup SymlinkDhcpdConf SymlinkSysconfigDhcpd SymlinkDhcpdLeases Dhcpd IP_dhcp colocation Dhcp_With_Filesystem inf: DhcpGroup FilesystemClone order Filesystem_Before_Dhcp inf: FilesystemClone DhcpGroup

cib commit dhcp quit

  # An important part of a high-availability configuration is STONITH = "Shoot the # other node in the head." Here's the idea: suppose one node fails for some reason. The
Line: 429 to 518
  # The official corosync distribution from <http://www.clusterlabs.org/> # does not include a script for NUT, so I had to write one. It's located at
Changed:
<
<
# /home/bin/nut.sh on both hypatia and orestes; there are appropriate links # to this script from the stonith/external directory.
>
>
# /home/bin//home/bin/fence_nut.pl on both hypatia and orestes; there are appropriate links # to this script from /usr/sbin/fence_nut.
  # The following commands implement the STONITH mechanism for our cluster:
Line: 439 to 528
  # The STONITH resource that can potentially shut down hypatia.

Changed:
<
<
configure primitive HypatiaStonith stonith:external/nut params hostname="hypatia.nevis.columbia.edu" ups="hypatia-ups" username="admin" password="acdc"
>
>
primitive StonithHypatia stonith:fence_nut params stonith-timeout="120s" pcmk_host_check="static-list" pcmk_host_list="hypatia.nevis.columbia.edu" ups="hypatia-ups" username="XXXX" password="XXXX" cycledelay="20" ondelay="20" offdelay="20" noverifyonoff="1" debug="1"
  # The node that runs the above script cannot be hypatia; it's # not wise to trust a node to STONITH itself. Note that the score # is "negative infinity," which means "never run this resource # on the named node."
Changed:
<
<
configure location HypatiaStonithLoc HypatiaStonith -inf: hypatia.nevis.columbia.edu
>
>
location StonithHypatia_Location StonithHypatia -inf: hypatia.nevis.columbia.edu
  # The STONITH resource that can potentially shut down orestes.
Changed:
<
<
configure primitive OrestesStonith stonith:external/nut params hostname="orestes.nevis.columbia.edu" ups="orestes-ups" username="admin" password="acdc"
>
>
primitive StonithOrestes stonith:fence_nut params stonith-timeout="120s" pcmk_host_check="static-list" pcmk_host_list="orestes.nevis.columbia.edu" ups="orestes-ups" username="XXXX" password="XXXX" cycledelay="20" ondelay="20" offdelay="20" noverifyonoff="1" debug="1"
  # Again, orestes cannot be the node that runs the above script.

Changed:
<
<
configure location OresetesStonithLoc OrestesStonith -inf: orestes.nevis.columbia.edu
>
>
location StonithOrestes_Location StonithOrestes -inf: orestes.nevis.columbia.edu
  cib commit stonith quit

Revision 52013-01-04 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Nevis particle-physics administrative cluster configuration

Line: 21 to 21
 /home/bin/recover-symlinks.sh
/etc/rc.d/rc.fix-pacemaker-delay (on hypatia only)
Changed:
<
<
The links are to an external site, pastebin; I use this in case I want to consult with someone on the HA setup. If you're reading this from a hardcopy, you can find all these files by visiting http://www.pastbin.com and searching for wgseligman 20130103.
>
>
The links are to an external site, pastebin; I use this in case I want to consult with someone on the HA setup. If you're reading this from a hardcopy, you can find all these files by visiting http://pastebin.com/u/wgseligman and searching for 20130103.
 

One-time set-up

Line: 72 to 72
 

Clustered LVM setup

Changed:
<
<
The following commands only have to be issued on one of the nodes.
>
>
Most of the following commands only have to be issued on one of the nodes. See Clusters From Scratch and Redhat Cluster Tutorial for details.
 
  • Edit /etc/lvm/lvm.conf on both systems; search this file for the initials WGS for a complete list of changes.
    • Change the filter line to search for DRBD partitions:
      filter = [ "a|/dev/drbd.*|", "a|/dev/md1|", "r|.*|" ]
Line: 94 to 94
 
  • Reboot both nodes.
Changed:
<
<

Commands

>
>

Pacemaker configuration

Commands

  The configuration has definitely changed from that listed below. To see the current configuration, run this as root on either hypatia or orestes:
crm configure show
Added:
>
>
To see the status of all the resources:
crm resource status
 To get a constantly-updated display of the resource status, the following command is the corosync equivalent of "top" (use Ctrl-C to exit):
crm_mon
Line: 110 to 116
 sudo crm_mon
Changed:
<
<

Concepts

>
>

Concepts

  This may help as you work your way through the configuration:
Changed:
<
<
crm configure primitive IP ocf:heartbeat:IPaddr2 params ip=192.168.85.3 \
>
>
crm configure primitive MyIPResource ocf:heartbeat:IPaddr2 params ip=192.168.85.3 \
  cidr_netmask=32 op monitor interval=30s

# Which is composed of * crm ::= "cluster resource manager", the command we're executing * primitive ::= The type of resource object that we’re creating.

Changed:
<
<
* IP ::= Our name for the resource
>
>
* MyIPResource ::= Our name for the resource
  * IPaddr2 ::= The script to call * ocf ::= The standard it conforms to * ip=192.168.85.3 ::= Parameter(s) as name/value pairs
Line: 144 to 150
 crm ra meta ocf:heartbeat:IPaddr2
Changed:
<
<

Configuration

>
>

Initial configuration guide

 
Changed:
<
<
This work was done in Sep-2010, with major revisions for stability in Aug-2011. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them.
>
>
This work was done in Apr-2012. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them.
 
# The commands ultimately used to configure the high-availability (HA) servers:
Changed:
<
<
# The beginning: make sure corosync is running on both hypatia and orestes:

/sbin/service corosync start

# The following line is needed because we have only two machines in # the HA cluster.

>
>
# The beginning: make sure pacemaker is running on both hypatia and orestes:
 
Changed:
<
<
crm configure property no-quorum-policy=ignore
>
>
/sbin/service pacemaker status crm node status crm resource status
  # We'll configure STONITH later (see below)

crm configure property stonith-enabled=false

Deleted:
<
<
# Define IP addresses to be managed by the HA systems.

crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 params ip=129.236.252.11 cidr_netmask=32 op monitor interval=30s crm configure primitive LocalIP ocf:heartbeat:IPaddr2 params ip=10.44.7.11 cidr_netmask=32 op monitor interval=30s crm configure primitive SandboxIP ocf:heartbeat:IPaddr2 params ip=10.43.7.11 cidr_netmask=32 op monitor interval=30s

# Group these together, so they'll all be assigned to the same machine. # The name of the group is "MainIPGroup".

crm configure group MainIPGroup ClusterIP LocalIP SandboxIP

 # Let's continue by entering the crm utility for short sessions. I'm going to
Changed:
<
<
# test groups of commands before I commit them. (I omit the "configure show' # and "status" commands that I frequently typed in, in order to see that # everything was correct.)
>
>
# test groups of commands before I commit them. I omit the "crm configure show' # and "crm status" commands that I frequently typed in, in order to see that # everything was correct.

# I also omit the standard resource options # (e.g., "... op monitor interval="20" timeout="40" depth="0"...) to make the # commands look simpler. This particular option means to check that the # resource is running every 20 seconds, and to declare that the monitor operation # will generate an error if 40 seconds elapse without a response. You can see the # complete list with "crm configure show".

  # DRBD is a service that synchronizes the hard drives between two machines.
Changed:
<
<
# For our cluster, one machine will have access to the "master" copy # and make all the changes to that copy; the other machine will have the # "slave" copy and mindlessly duplicate all the changes.

# I previously configured the DRBD resources 'admin' and 'work'. What the # following commands do is put the maintenance of these resources under # the control of Pacemaker.

>
>
# When one machine makes any change to the DRBD disk, the other machine # immediately duplicates that change on the block level. We have a dual-primary # configuration, which means both machines can mount the DRBD disk at once.

# Start by entering the resource manager.

  crm
Added:
>
>
  # Define a "shadow" configuration, to test things without committing them # to the HA cluster: cib new drbd
Line: 197 to 192
  # to the HA cluster: cib new drbd

Changed:
<
<
# The "drbd_resource" parameter points to a configuration defined in /etc/drbd.d/

configure primitive AdminDrbd ocf:linbit:drbd params drbd_resource=admin op monitor interval=60s

# DRBD functions with a "master/slave" setup as described above. The following command # defines the name of the master disk partition ("Admin"). The remaining parameters # clarify that there are two copies, but only one can be the master, and # at most one can be a slave.

configure master Admin AdminDrbd meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true globally-unique=false

# The machine that gets the master copy (the one that will make changes to the drive) # should also be the one with the main IP address.

>
>
# The "drbd_resource" parameter points to a configuration defined in /etc/drbd.d/admin.res
 
Changed:
<
<
configure colocation AdminWithMainIP inf: MainIPGroup Admin:Master
>
>
primitive AdminDrbd ocf:linbit:drbd params drbd_resource="admin" meta target-role="Master"

# The following resources defines how the DRBD resource (AdminDrbd) is to # duplicated ("cloned") among the nodes. The parameters clarify that there are # two copies, one on each node, and both can be the master.

ms AdminClone AdminDrbd meta master-max="2" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" interleave="true"

 
Changed:
<
<
# We want to wait before assigning IPs to a node until we know that # Admin has been promoted to master on that node. configure order AdminBeforeMainIP inf: Admin:promote MainIPGroup

# I like these commands, so commit them to the running configuration.

cib commit drbd

# Things look good, so let's add another disk resource. I defined another drbd resource # with some spare disk space, called "work". The idea is that I can play with alternate # virtual machines and save them on "work" before I copy them to the more robust "admin".

configure primitive WorkDrbd ocf:linbit:drbd params drbd_resource=work op monitor interval=60s configure master Work WorkDrbd meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true globally-unique=false

# I prefer the work directory to be on the main admin box, but it doesn't have to be. "500:" is # weighting factor; compare it to "inf:" (for infinity) which is used in most of these commands.

configure colocation WorkPrefersMain 500: Work:Master MainIPGroup

# Given a choice, try to put the Admin:Master on hypatia

configure location DefinePreferredMainNode Admin 100: hypatia.nevis.columbia.edu

>
>
configure show
 
Added:
>
>
# Looks good, so commit the change.
  cib commit drbd quit
Changed:
<
<
# Now try a resource that depends on ordering: On the node that has the master # resource for "work," mount that disk image as /work.
>
>
# Now define resources that depend on ordering.
 crm
Changed:
<
<
cib new workdisk
>
>
cib new disk

# The DRBD is available to the system. The next step is to tell LVM # that the volume group ADMIN exists on the disk.

 
Changed:
<
<
# To find out that there was an "ocf:heartbeat:Filesystem" that I could use,
>
>
# To find out that there was a resource "ocf:heartbeat:LVM" that I could use,
  # I used the command: ra classes

Line: 255 to 227
  ra list ocf heartbeat

Changed:
<
<
# To find out what Filesystem parameters I needed, I used:
>
>
# To find out what LVM parameters I needed, I used:
 
Changed:
<
<
ra meta ocf:heartbeat:Filesystem
>
>
ra meta ocf:heartbeat:LVM
  # All of the above led me to create the following resource configuration:

Changed:
<
<
configure primitive WorkDirectory ocf:heartbeat:Filesystem params device="/dev/drbd2" directory="/work" fstype="ext4"
>
>
primitive AdminLvm ocf:heartbeat:LVM params volgrpname="ADMIN"
 
Changed:
<
<
# Note that I had previously created an ext4 filesystem on /dev/drbd2.
>
>
# After I set up the volume group, I want to mount the logical volumes # (partitions) within the volume group. Here's one of the partitions, /usr/nevis; # note that I begin all the filesystem resources with FS so they'll be next # to each other when I type "crm configure show".
 
Changed:
<
<
# Now specify that we want this to be on the same node as Work:Master:
>
>
primitive FSUsrNevis ocf:heartbeat:Filesystem params device="/dev/mapper/ADMIN-usr" directory="/usr/nevis" fstype="gfs2" options="defaults,noatime,nodiratime"
 
Changed:
<
<
configure colocation DirectoryWithWork inf: WorkDirectory Work:Master
>
>
# I have similar definitions for the other logical volumes in volume group ADMIN: # /mail, /var/nevis, etc.
 
Changed:
<
<
# One more thing: It's important that we not try to mount the directory # until after Work has been promoted to master on the node.
>
>
# Now I'm going to define a resource group. The following command means: # - Put all these resources on the same node; # - Start these resources in the order they're listed; # - The resources depend on each other in the order they're listed. For example, # if AdminLvm fails, FSUsrNevis will not start, or will be stopped if it's running.
 
Changed:
<
<
# A score of "inf" means "infinity"; if the DRBD resource 'work' can't # be set up, then don't mount the /work partition.
>
>
group FilesystemGroup AdminLvm FSUsrNevis FSVarNevis FSVirtualMachines FSMail FSWork
 
Changed:
<
<
configure order WorkBeforeDirectory inf: Work:promote WorkDirectory:start
>
>
# We want these logical volumes (or partitions or filesystems) to be available # on both nodes. To do this, we define a clone resource.
 
Changed:
<
<
cib commit workdisk quit

# We've made the relatively-unimportant DRBD resource 'work' function. Let's do it for 'admin'. # Previously I created some LVM volumes on the admin DRBD master. We need to use a # resource to active them, but we can't activate them until after the Admin:Master # is loaded. crm cib new lvm

>
>
clone FilesystemClone FilesystemGroup meta interleave="true"
 
Changed:
<
<
# Activate the LVM volumes, but only after DRBD has figured out where # Admin:Master is located.
>
>
# One more thing: It's important that we not try to set up the filesystems # until the DRBD admin resource is running on a node, and has been # promoted to master.
 
Changed:
<
<
configure primitive Lvm ocf:heartbeat:LVM params volgrpname="admin" configure colocation LvmWithAdmin inf: Lvm Admin:Master configure order AdminBeforeLvm inf: Admin:promote Lvm:start
>
>
# A score of "inf" means "infinity"; if the DRBD resource 'AdminClone' can't # be promoted, then don't start the 'FilesystemClone' resource.
 
Changed:
<
<
cib commit lvm
>
>
colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
 
Changed:
<
<
# Go back to the actual, live configuration

cib use live

# See if everything is working

configure show status

# Go back to the shadow for more commands.

cib use lvm

# We have a whole bunch of filesystems on the "admin" volume group. Let's # create the commands to mount them.

# The 'timeout="240s' piece is to give a four-minute interval to start # up the mount. This allows for a "it's been too long, do an fsck" check # on mounting the filesystem.

# We also allow five minutes for the unmounting to stop, just in case # it's taking a while for some job on server to let go of the mount. # It's better that it take a while to switch over the system service # than for the mount to be forcibly terminated.

configure primitive UsrDirectory ocf:heartbeat:Filesystem params device="/dev/admin/usr" directory="/usr/nevis" fstype="ext4" op start interval="0" timeout="240s" op stop interval="0" timeout="300s"

configure primitive VarDirectory ocf:heartbeat:Filesystem params device="/dev/admin/var" directory="/var/nevis" fstype="ext4" op start interval="0" timeout="240s" op stop interval="0" timeout="300s"

configure primitive MailDirectory ocf:heartbeat:Filesystem params device="/dev/admin/mail" directory="/mail" fstype="ext4" op start interval="0" timeout="240s" op stop interval="0" timeout="300s"

configure primitive XenDirectory ocf:heartbeat:Filesystem params device="/dev/admin/xen" directory="/xen" fstype="ext4" op start interval="0" timeout="240s" op stop interval="0" timeout="300s"

configure group AdminDirectoriesGroup UsrDirectory VarDirectory MailDirectory XenDirectory

# We can't mount any of them until LVM is set up:

configure colocation DirectoriesWithLVM inf: AdminDirectoriesGroup Lvm configure order LvmBeforeDirectories inf: Lvm AdminDirectoriesGroup

cib commit lvm

>
>
cib commit disk
  quit

# Some standard Linux services are under corosync's control. They depend on some or

Line: 511 to 432
 # /home/bin/nut.sh on both hypatia and orestes; there are appropriate links # to this script from the stonith/external directory.
Deleted:
<
<
# By the way, I sent the script to Cluster Labs, who accepted it. # The next generation of their distribution will include the script.
 # The following commands implement the STONITH mechanism for our cluster:

crm

Line: 549 to 467
  crm configure property stonith-enabled=true
Deleted:
<
<
# At this point, the key elements of the high-availability configuration have # been set up. There is one non-critical frill: One node (probably hypatia) will be # running the important services, while the other node (probably orestes) would # be "twiddling its thumbs." Instead, let's have orestes do something useful: execute # condor jobs.

# For orestes to do this, it requires the condor service. It also requires that # library:/usr/nevis is mounted, the same as every other batch machine on the # Nevis condor cluster. We can't use the automount daemon (amd) to do this for # us, the way we do on the other batch nodes; we have to make corosync do the # mounts.

crm cib new condor

# Mount library:/usr/nevis. A bit of a name confusion here: there's a /work # partition on the primary node, but the name 'LibraryOnWork" means that # the nfs-mount of /usr/nevis is located on the secondary or "work" node.

configure primitive LibraryOnWork ocf:heartbeat:Filesystem params device="library:/usr/nevis" directory="/usr/nevis" fstype="nfs"

# Corosync must not NFS-mount library:/usr/nevis on the system has already # mounted /usr/nevis directly as part of AdminDirectoriesGroup # described above.

# Note that if there's only one node remaining in the high-availability # cluster, it will be running the resource AdminDirectoriesGroup, and # LibraryOnWork will never be started. This is fine; if there's only one # node left, I don't want it running condor jobs.

configure colocation NoRemoteMountWithDirectories -inf: LibraryOnWork AdminDirectoriesGroup

# Determine on which machine we mount library:/usr/nevis after the NFS # export of /usr/nevis has been set up.

configure order NfsBeforeLibrary inf: Nfs LibraryOnWork

# Define the IPs associated with the backup system, and group them together. # This is a non-critical definition, and I don't want to assign it until the more important # "secondary" resources have been set up.

configure primitive Burr ocf:heartbeat:IPaddr2 params ip=129.236.252.10 cidr_netmask=32 op monitor interval=30s configure primitive BurrLocal ocf:heartbeat:IPaddr2 params ip=10.44.7.10 cidr_netmask=32 op monitor interval=30s configure group AssistantIPGroup Burr BurrLocal

colocation AssistantWithLibrary inf: AssistantIPGroup LibraryOnWork order LibraryBeforeAssistant inf: LibraryOnWork AssistantIPGroup

# The standard condor execution service. As with all the batch nodes, # I've already configured /etc/condor/condor_config.local and created # scratch directories in /data/condor.

configure primitive Condor lsb:condor

# If we're able mount library:/usr/nevis, then it's safe to start condor. # If we can't mount library:/usr/nevis, then condor will never be started. # (We stated above that AssistantIPGroup won't start until after LibraryOnWork).

configure colocation CondorWithAssistant inf: Condor AssistantIPGroup configure order AssistantBeforeCondor inf: AssistantIPGroup Condor

cib commit condor quit

 

META TOPICMOVED by="WilliamSeligman" date="1348092384" from="Nevis.CorosyncDualPrimaryConfiguration" to="Nevis.PacemakerDualPrimaryConfiguration"

Revision 42013-01-03 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Nevis particle-physics administrative cluster configuration

Line: 10 to 10
 

Files

Changed:
<
<
Key HA configuration files. Note: Even in an emergency, there's no reason to edit these files:
/etc/cluster/cluster.conf
/etc/lvm/lvm.conf
/etc/drbd.d/global_common.conf
/etc/drbd.d/admin.res
/home/bin/fence_nut.pl
/etc/rc.d/rc.local
/home/bin/recover-symlinks.sh
/etc/rc.d/rc.fix-pacemaker-delay (on hypatia only)
>
>
Key HA configuration files. Note: Even in an emergency, there's no reason to edit these files!

/etc/cluster/cluster.conf
/etc/lvm/lvm.conf
/etc/drbd.d/global_common.conf
/etc/drbd.d/admin.res
/home/bin/fence_nut.pl
/etc/rc.d/rc.local
/home/bin/recover-symlinks.sh
/etc/rc.d/rc.fix-pacemaker-delay (on hypatia only)

The links are to an external site, pastebin; I use this in case I want to consult with someone on the HA setup. If you're reading this from a hardcopy, you can find all these files by visiting http://www.pastbin.com and searching for wgseligman 20130103.

 

One-time set-up

Line: 30 to 31
 

DRBD set-up

Changed:
<
<
Edit /etc/drbd.d/global_common.conf and create /etc/drbd.d/admin.res. Then on hypatia:
>
>
Edit /etc/drbd.d/global_common.conf and create /etc/drbd.d/admin.res. Then on hypatia:
 
/sbin/drbdadm create-md admin
/sbin/service drbd start
Line: 69 to 70
  Here's a guide to understanding the contents of /proc/drbd.
Added:
>
>

Clustered LVM setup

The following commands only have to be issued on one of the nodes.

  • Edit /etc/lvm/lvm.conf on both systems; search this file for the initials WGS for a complete list of changes.
    • Change the filter line to search for DRBD partitions:
      filter = [ "a|/dev/drbd.*|", "a|/dev/md1|", "r|.*|" ]
    • For lvm locking:
      locking_type = 3

  • Edit /etc/sysconfig/cman to disable quorum (because there's only two nodes on the cluster):
    sed -i.sed "s/.*CMAN_QUORUM_TIMEOUT=.*/CMAN_QUORUM_TIMEOUT=0/g" /etc/sysconfig/cman

  • Create a physical volume and a clustered volume group on the DRBD partition:
    pvcreate /dev/drbd0
    vgcreate -c y ADMIN /dev/drbd0

  • For each logical volume in the volume group, create the volume and install a GFS2 filesystem; for example:
    lvcreate -L 200G -n usr ADMIN # ... and so on
    mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr
    Note that Nevis_HA is the cluster name defined in /etc/cluster/cluster.conf.

  • Make sure that cman, clvm2, and pacemaker daemons will start at boot; on both nodes, do:
    /sbin/chkconfig cman on
    /sbin/chkconfig clvmd on
    /sbin/chkconfig pacemaker on

  • Reboot both nodes.
 

Commands

The configuration has definitely changed from that listed below. To see the current configuration, run this as root on either hypatia or orestes:

Revision 32012-09-20 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Nevis particle-physics administrative cluster configuration

Line: 19 to 19
 /home/bin/fence_nut.pl /etc/rc.d/rc.local /home/bin/recover-symlinks.sh
Added:
>
>
/etc/rc.d/rc.fix-pacemaker-delay (on hypatia only)
 

One-time set-up

Revision 22012-09-20 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"
Changed:
<
<

Nevis particle-physics administrative cluster configuration

>
>

Nevis particle-physics administrative cluster configuration

  This is a reference page. It contains a text file that describes how the high-availability pacemaker configuration was set up on two administrative servers, hypatia and orestes.
Line: 13 to 17
 /etc/drbd.d/global_common.conf /etc/drbd.d/admin.res /home/bin/fence_nut.pl
Added:
>
>
/etc/rc.d/rc.local /home/bin/recover-symlinks.sh

One-time set-up

The commands to set up a dual-primary cluster are outlined here. The details can be found in Clusters From Scratch and Redhat Cluster Tutorial.

Warning: Do not type any of these commands in the hopes of fixing a problem! They will erase the shared DRBD drive.

DRBD set-up

Edit /etc/drbd.d/global_common.conf and create /etc/drbd.d/admin.res. Then on hypatia:

/sbin/drbdadm create-md admin
/sbin/service drbd start
/sbin/drbdadm up admin

Then on orestes:

/sbin/drbdadm --force create-md admin
/sbin/service drbd start
/sbin/drbdadm up admin
 
Added:
>
>
Back to hypatia:
/sbin/drbdadm -- --overwrite-data-of-peer primary admin
cat /proc/drbd

Keeping looking at the contents of /proc/drbd. It will take a while, but eventually the two disks will sync up.

Back to orestes:

/sbin/drbdadm primary admin
cat /proc/drbd

The result should be something like this:

# cat /proc/drbd
version: 8.4.1 (api:1/proto:86-100)
GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by root@hypatia-tb.nevis.columbia.edu, 2012-02-14 17:04:51
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:162560777 nr:78408289 dw:240969067 dr:747326438 al:10050 bm:1583 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Here's a guide to understanding the contents of /proc/drbd.

 

Commands

The configuration has definitely changed from that listed below. To see the current configuration, run this as root on either hypatia or orestes:

Revision 12012-09-19 - WilliamSeligman

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="Computing"

Nevis particle-physics administrative cluster configuration

This is a reference page. It contains a text file that describes how the high-availability pacemaker configuration was set up on two administrative servers, hypatia and orestes.

Files

Key HA configuration files. Note: Even in an emergency, there's no reason to edit these files:

/etc/cluster/cluster.conf
/etc/lvm/lvm.conf
/etc/drbd.d/global_common.conf
/etc/drbd.d/admin.res
/home/bin/fence_nut.pl

Commands

The configuration has definitely changed from that listed below. To see the current configuration, run this as root on either hypatia or orestes:

crm configure show
To get a constantly-updated display of the resource status, the following command is the corosync equivalent of "top" (use Ctrl-C to exit):
crm_mon
You can run the above commands via sudo, but you'll have to extend your path; e.g.,
export PATH=/sbin:/usr/sbin:${PATH}
sudo crm_mon

Concepts

This may help as you work your way through the configuration:

crm configure primitive IP ocf:heartbeat:IPaddr2 params ip=192.168.85.3 \
   cidr_netmask=32 op monitor interval=30s

# Which is composed of
    * crm ::= "cluster resource manager", the command we're executing
    * primitive ::= The type of resource object that we’re creating.
    * IP ::= Our name for the resource
    * IPaddr2 ::= The script to call
    * ocf ::= The standard it conforms to
    * ip=192.168.85.3 ::= Parameter(s) as name/value pairs
    * cidr_netmask ::= netmask; 32-bits means use this exact IP address
    * op ::== what follows are options
    * monitor interval=30s ::= check every 30 seconds that this resource is working

# ... timeout = how to long wait before you assume a resource is dead. 

How to find out which scripts exist, that is, which resources can be controlled by the HA cluster:

crm ra classes
Based on the result, I looked at:
crm ra list ocf heartbeat
To find out what IPaddr2 parameters I needed, I used:
crm ra meta ocf:heartbeat:IPaddr2

Configuration

This work was done in Sep-2010, with major revisions for stability in Aug-2011. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them.

# The commands ultimately used to configure the high-availability (HA) servers:

# The beginning: make sure corosync is running on both hypatia and orestes:

/sbin/service corosync start

# The following line is needed because we have only two machines in 
# the HA cluster.

crm configure property no-quorum-policy=ignore

# We'll configure STONITH later (see below)

crm configure property stonith-enabled=false

# Define IP addresses to be managed by the HA systems.

crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 params ip=129.236.252.11 \
   cidr_netmask=32 op monitor interval=30s
crm configure primitive LocalIP ocf:heartbeat:IPaddr2 params ip=10.44.7.11 \
   cidr_netmask=32 op monitor interval=30s
crm configure primitive SandboxIP ocf:heartbeat:IPaddr2 params ip=10.43.7.11 \
   cidr_netmask=32 op monitor interval=30s
   
# Group these together, so they'll all be assigned to the same machine.
# The name of the group is "MainIPGroup".

crm configure group MainIPGroup ClusterIP LocalIP SandboxIP

# Let's continue by entering the crm utility for short sessions. I'm going to 
# test groups of commands before I commit them. (I omit the "configure show' 
# and "status" commands that I frequently typed in, in order to see that 
# everything was correct.)
   
# DRBD is a service that synchronizes the hard drives between two machines.
# For our cluster, one machine will have access to the "master" copy
# and make all the changes to that copy; the other machine will have the
# "slave" copy and mindlessly duplicate all the changes.

# I previously configured the DRBD resources 'admin' and 'work'. What the 
# following commands do is put the maintenance of these resources under
# the control of Pacemaker. 

crm
   # Define a "shadow" configuration, to test things without committing them
   # to the HA cluster:
   cib new drbd
   
   # The "drbd_resource" parameter points to a configuration defined in /etc/drbd.d/
   
   configure primitive AdminDrbd ocf:linbit:drbd params drbd_resource=admin op monitor interval=60s
   
   # DRBD functions with a "master/slave" setup as described above. The following command
   # defines the name of the master disk partition ("Admin"). The remaining parameters
   # clarify that there are two copies, but only one can be the master, and
   # at most one can be a slave.
   
   configure master Admin AdminDrbd meta master-max=1 master-node-max=1 \
      clone-max=2 clone-node-max=1 notify=true globally-unique=false
      
   # The machine that gets the master copy (the one that will make changes to the drive)
   # should also be the one with the main IP address.
   
   configure colocation AdminWithMainIP inf: MainIPGroup Admin:Master

   # We want to wait before assigning IPs to a node until we know that
   # Admin has been promoted to master on that node. 
   configure order AdminBeforeMainIP inf: Admin:promote MainIPGroup

   # I like these commands, so commit them to the running configuration.
   
   cib commit drbd
   
   # Things look good, so let's add another disk resource. I defined another drbd resource
   # with some spare disk space, called "work". The idea is that I can play with alternate 
   # virtual machines and save them on "work" before I copy them to the more robust "admin".
   
   configure primitive WorkDrbd ocf:linbit:drbd params drbd_resource=work op monitor interval=60s
   configure master Work WorkDrbd meta master-max=1 master-node-max=1 \
      clone-max=2 clone-node-max=1 notify=true globally-unique=false
      
   # I prefer the work directory to be on the main admin box, but it doesn't have to be. "500:" is 
   # weighting factor; compare it to "inf:" (for infinity) which is used in most of these commands. 
   
   configure colocation WorkPrefersMain 500: Work:Master MainIPGroup
      
   # Given a choice, try to put the Admin:Master on hypatia
   
   configure location DefinePreferredMainNode Admin 100: hypatia.nevis.columbia.edu

   cib commit drbd
   quit

# Now try a resource that depends on ordering: On the node that has the master
# resource for "work," mount that disk image as /work.
crm
   cib new workdisk
   
   # To find out that there was an "ocf:heartbeat:Filesystem" that I could use,
   # I used the command:
   ra classes
   
   # Based on the result, I looked at:
   
   ra list ocf heartbeat
   
   # To find out what Filesystem parameters I needed, I used:
   
   ra meta ocf:heartbeat:Filesystem
   
   # All of the above led me to create the following resource configuration:
   
   configure primitive WorkDirectory ocf:heartbeat:Filesystem \
      params device="/dev/drbd2" directory="/work" fstype="ext4"
      
   # Note that I had previously created an ext4 filesystem on /dev/drbd2.
   
   # Now specify that we want this to be on the same node as Work:Master:
   
   configure colocation DirectoryWithWork inf: WorkDirectory Work:Master
   
   # One more thing: It's important that we not try to mount the directory
   # until after Work has been promoted to master on the node.
   
   # A score of "inf" means "infinity"; if the DRBD resource 'work' can't
   # be set up, then don't mount the /work partition. 
   
   configure order WorkBeforeDirectory inf: Work:promote WorkDirectory:start
   
   cib commit workdisk
   quit

# We've made the relatively-unimportant DRBD resource 'work' function. Let's do it for 'admin'.
# Previously I created some LVM volumes on the admin DRBD master. We need to use a 
# resource to active them, but we can't activate them until after the Admin:Master
# is loaded.
crm
   cib new lvm
   
   # Activate the LVM volumes, but only after DRBD has figured out where
   # Admin:Master is located.
   
   configure primitive Lvm ocf:heartbeat:LVM \
      params volgrpname="admin"
   configure colocation LvmWithAdmin inf: Lvm Admin:Master
   configure order AdminBeforeLvm inf: Admin:promote Lvm:start
   
   cib commit lvm
   
   # Go back to the actual, live configuration
   
   cib use live
   
   # See if everything is working
   
   configure show 
   status
   
   # Go back to the shadow for more commands.
   
   cib use lvm
   
   # We have a whole bunch of filesystems on the "admin" volume group. Let's
   # create the commands to mount them.
   
   # The 'timeout="240s' piece is to give a four-minute interval to start
   # up the mount. This allows for a "it's been too long, do an fsck" check
   # on mounting the filesystem. 
   
   # We also allow five minutes for the unmounting to stop, just in case 
   # it's taking a while for some job on server to let go of the mount.
   # It's better that it take a while to switch over the system service
   # than for the mount to be forcibly terminated.
   
   configure primitive UsrDirectory ocf:heartbeat:Filesystem \
      params device="/dev/admin/usr" directory="/usr/nevis" fstype="ext4" \
      op start interval="0" timeout="240s" \
      op stop interval="0" timeout="300s"
      
   configure primitive VarDirectory ocf:heartbeat:Filesystem \
      params device="/dev/admin/var" directory="/var/nevis" fstype="ext4" \
      op start interval="0" timeout="240s" \
      op stop interval="0" timeout="300s"
      
   configure primitive MailDirectory ocf:heartbeat:Filesystem \
      params device="/dev/admin/mail" directory="/mail" fstype="ext4" \
      op start interval="0" timeout="240s" \
      op stop interval="0" timeout="300s"
      
   configure primitive XenDirectory ocf:heartbeat:Filesystem \
      params device="/dev/admin/xen" directory="/xen" fstype="ext4" \
      op start interval="0" timeout="240s" \
      op stop interval="0" timeout="300s"
      
   configure group AdminDirectoriesGroup UsrDirectory VarDirectory MailDirectory XenDirectory
   
   # We can't mount any of them until LVM is set up:
   
   configure colocation DirectoriesWithLVM inf: AdminDirectoriesGroup Lvm
   configure order LvmBeforeDirectories inf: Lvm AdminDirectoriesGroup

   cib commit lvm
   quit

# Some standard Linux services are under corosync's control. They depend on some or
# all of the filesystems being mounted. 
   
# Let's start with a simple one: enable the printing service (cups):

crm
   cib new printing
   
   # lsb = "Linux Standard Base." It just means any service which is
   # controlled by the one of the standard scripts in /etc/init.d
   
   configure primitive Cups lsb:cups
   
   # Cups stores its spool files in /var/spool/cups. If the cups service 
   # were to switch to a different server, we want the new server to see 
   # the spooled files. So create /var/nevis/cups, link it with:
   #   mv /var/spool/cups /var/spool/cups.ori
   #   ln -sf /var/nevis/cups /var/spool/cups
   # and demand that the cups service only start if /var/nevis (and the other
   # high-availability directories) have been mounted.
   
   configure colocation CupsWithVar inf: Cups AdminDirectoriesGroup
   
   # In order to prevent chaos, make sure that the high-availability directories
   # have been mounted before we try to start cups.
   
   configure order VarBeforeCups inf: AdminDirectoriesGroup Cups
   
   cib commit printing
   quit

# The other services (xinetd, dhcpd) follow the same pattern as above:
# Make sure the services start on the same machine as the admin directories,
# and after the admin directories are successfully mounted.

crm
   cib new services
   
   configure primitive Xinetd lsb:xinetd
   configure primitive Dhcpd lsb:dhcpd
   
   configure colocation XinetdWithVar inf: Xinetd AdminDirectoriesGroup
   configure order VarBeforeXinetd inf: VarDirectory Xinetd
   
   configure colocation DhcpdWithVar inf: Dhcpd AdminDirectoriesGroup
   configure order VarBeforeDhcpd inf: VarDirectory Dhcpd
   
   cib commit services
   quit

# The high-availability servers export some of the admin directories to other
# systems, both real and virtual; for example, the /usr/nevis directory is 
# exported to all the other machines on the Nevis Linux cluster. 

# NFS exporting of a shared directory can be a little tricky. As with CUPS 
# spooling, we want to preserve the NFS export state in a way that the 
# backup server can pick it up. The safest way to do this is to create a 
# small separate LVM partition ("nfs") and mount it as "/var/lib/nfs",
# the NFS directory that contains files that keep track of the NFS state.

crm
   cib new nfs
   
   # Define the mount for the NFS state directory /var/lib/nfs
   
   configure primitive NfsStateDirectory ocf:heartbeat:Filesystem \
         params device="/dev/admin/nfs" directory="/var/lib/nfs" fstype="ext4"
   configure colocation NfsStateWithVar inf: NfsStateDirectory AdminDirectoriesGroup
   configure order VarBeforeNfsState inf: AdminDirectoriesGroup NfsStateDirectory

   # Now that the NFS state directory is mounted, we can start nfslockd. Note that
   # that we're starting NFS lock on both the primary and secondary HA systems;
   # by default a "clone" resource is started on all systems in a cluster. 

   # (Placing nfslockd under the control of Pacemaker turned out to be key to
   # successful transfer of cluster services to another node. The nfslockd and
   # nfs daemon information stored in /var/lib/nfs have to be consistent.)

   configure primitive NfsLockInstance lsb:nfslock
   configure clone NfsLock NfsLockInstance

   configure order NfsStateBeforeNfsLock inf: NfsStateDirectory NfsLock

   # Once nfslockd has been set up, we can start NFS. (We say to colocate
   # NFS with 'NfsStateDirectory', instead of nfslockd, because nfslockd
   # is going to be started on both nodes.)
   
   configure primitive Nfs lsb:nfs
   configure colocation NfsWithNfsState inf: Nfs NfsStateDirectory
   configure order NfsLockBeforeNfs inf: NfsLock Nfs 
   
   cib commit nfs
   quit

   
# The whole point of the entire setup is to be able to run guest virtual machines 
# under the control of the high-availability service. Here is the set-up for one example
# virtual machine. I previously created the hogwarts virtual machine and copied its
# configuration to /xen/configs/hogwarts.cfg.

# I duplicated the same procedure for franklin (mail server), ada (web server), and
# so on, but I don't show that here.

crm
   cib new hogwarts
   
   # Give the virtual machine a long stop interval before flagging an error.
   # Sometimes it takes a while for Linux to shut down.
   
   configure primitive Hogwarts ocf:heartbeat:Xen params \
      xmfile="/xen/configs/Hogwarts.cfg" \
         op stop interval="0" timeout="240"

   # All the virtual machine files are stored in the /xen partition, which is one
   # of the high-availability admin directories. The virtual machine must run on
   # the system with this directory.

   configure colocation HogwartsWithDirectories inf: Hogwarts AdminDirectoriesGroup

   # All of the virtual machines depend on NFS-mounting directories which
   # are exported by the HA server. The safest thing to do is to make sure
   # NFS is running on the HA server before starting the virtual machine.
   
   configure order NfsBeforeHogwarts inf: Nfs Hogwarts
   
   cib commit hogwarts
   quit


# An important part of a high-availability configuration is STONITH = "Shoot the
# other node in the head." Here's the idea: suppose one node fails for some reason. The
# other node will take over as needed. 

# Suppose the failed node tries to come up again. This can be a problem: The other node
# may have accumulated changes that the failed node doesn't know about. There can be
# synchronization issues that require manual intervention.

# The STONITH mechanism means: If a node fails, the remaining node(s) in a cluster will
# force a permanent shutdown of the failed node; it can't automatically come back up again.
# This is a special case of "fencing": once a node or resource fails, it can't be allowed
# to start up again automatically.

# In general, there are many ways to implement a STONITH mechanism. At Nevis, the way
# we do it is to shut-off the power on the UPS connected to the failed node.

# (By the way, this is why you have to be careful about restarting hypatia or orestes.
# The STONITH mechanism may cause the UPS on the restarting
# computer to turn off the power; it will never come back up.)

# At Nevis, the UPSes are monitored and controlled using the NUT package
# <http://www.networkupstools.org/>; details are on the Nevis wiki at
# <http://www.nevis.columbia.edu/twiki/bin/view/Nevis/Ups>.

# The official corosync distribution from <http://www.clusterlabs.org/> 
# does not include a script for NUT, so I had to write one. It's located at
# /home/bin/nut.sh on both hypatia and orestes; there are appropriate links
# to this script from the stonith/external directory. 

# By the way, I sent the script to Cluster Labs, who accepted it.
# The next generation of their distribution will include the script.

# The following commands implement the STONITH mechanism for our cluster:

crm
   cib new stonith
   
   # The STONITH resource that can potentially shut down hypatia.
   
   configure primitive HypatiaStonith stonith:external/nut \
      params hostname="hypatia.nevis.columbia.edu" \
      ups="hypatia-ups" username="admin" password="acdc"
      
   # The node that runs the above script cannot be hypatia; it's
   # not wise to trust a node to STONITH itself. Note that the score
   # is "negative infinity," which means "never run this resource
   # on the named node."

   configure location HypatiaStonithLoc HypatiaStonith -inf: hypatia.nevis.columbia.edu

   # The STONITH resource that can potentially shut down orestes.

   configure primitive OrestesStonith stonith:external/nut \
      params hostname="orestes.nevis.columbia.edu" \
      ups="orestes-ups" username="admin" password="acdc"

   # Again, orestes cannot be the node that runs the above script.
   
   configure location OresetesStonithLoc OrestesStonith -inf: orestes.nevis.columbia.edu
   
   cib commit stonith
   quit

# Now turn the STONITH mechanism on for the cluster.

crm configure property stonith-enabled=true


# At this point, the key elements of the high-availability configuration have
# been set up. There is one non-critical frill: One node (probably hypatia) will be 
# running the important services, while the other node (probably orestes) would
# be "twiddling its thumbs." Instead, let's have orestes do something useful: execute
# condor jobs.

# For orestes to do this, it requires the condor service. It also requires that
# library:/usr/nevis is mounted, the same as every other batch machine on the
# Nevis condor cluster. We can't use the automount daemon (amd) to do this for
# us, the way we do on the other batch nodes; we have to make corosync do the
# mounts.

crm
   cib new condor
   
   # Mount library:/usr/nevis. A bit of a name confusion here: there's a /work
   # partition on the primary node, but the name 'LibraryOnWork" means that
   # the nfs-mount of /usr/nevis is located on the secondary or "work" node.
      
   configure primitive LibraryOnWork ocf:heartbeat:Filesystem \
      params device="library:/usr/nevis" directory="/usr/nevis" \
      fstype="nfs"  
      
   # Corosync must not NFS-mount library:/usr/nevis on the system has already 
   # mounted /usr/nevis directly as part of AdminDirectoriesGroup
   # described above. 
   
   # Note that if there's only one node remaining in the high-availability 
   # cluster, it will be running the resource AdminDirectoriesGroup, and 
   # LibraryOnWork will never be started. This is fine; if there's only one
   # node left, I _don't_ want it running condor jobs.
   
   configure colocation NoRemoteMountWithDirectories -inf: LibraryOnWork AdminDirectoriesGroup

   # Determine on which machine we mount library:/usr/nevis _after_ the NFS
   # export of /usr/nevis has been set up.  
   
   configure order NfsBeforeLibrary inf: Nfs LibraryOnWork

   # Define the IPs associated with the backup system, and group them together.
   # This is a non-critical definition, and I don't want to assign it until the more important
   # "secondary" resources have been set up. 

   configure primitive Burr ocf:heartbeat:IPaddr2 params ip=129.236.252.10 \
      cidr_netmask=32 op monitor interval=30s
   configure primitive BurrLocal ocf:heartbeat:IPaddr2 params ip=10.44.7.10 
      cidr_netmask=32 op monitor interval=30s
   configure group AssistantIPGroup Burr BurrLocal

   colocation AssistantWithLibrary inf: AssistantIPGroup LibraryOnWork
   order LibraryBeforeAssistant inf: LibraryOnWork AssistantIPGroup

   # The standard condor execution service. As with all the batch nodes,
   # I've already configured /etc/condor/condor_config.local and created
   # scratch directories in /data/condor.
      
   configure primitive Condor lsb:condor

   # If we're able mount library:/usr/nevis, then it's safe to start condor.
   # If we can't mount library:/usr/nevis, then condor will never be started.
   # (We stated above that AssistantIPGroup won't start until after LibraryOnWork).
   
   configure colocation CondorWithAssistant inf: Condor AssistantIPGroup
   configure order AssistantBeforeCondor inf: AssistantIPGroup Condor
   
   cib commit condor
   quit

META TOPICMOVED by="WilliamSeligman" date="1348092384" from="Nevis.CorosyncDualPrimaryConfiguration" to="Nevis.PacemakerDualPrimaryConfiguration"
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback