TWiki
>
Main Web
>
Computing
>
PacemakerDualPrimaryConfiguration
(2014-07-01,
WilliamSeligman
)
(raw view)
E
dit
A
ttach
---+!! Nevis particle-physics administrative cluster configuration <div style="float:right; background-color:#EBEEF0; margin:0 0 20px 20px; padding: 0 10px 0 10px;"> %TOC{title="On this page:"}% </div> _Archived 20-Sep-2013_: The high-availability cluster has been set aside in favor of a more traditional single-box admin server. HA is grand in theory, but in the three years we operated the cluster we had no hardware problems which the HA set-up would have prevented, but many hours of downtime due to problems with the HA software. [[http://www.gossamer-threads.com/lists/linuxha/users/87132][This mailing-list post]] has some details. This is a reference page. It contains a text file that describes how the high-availability [[http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/index.html][pacemaker]] configuration was set up on two administrative [[ListOfMachines][servers]], =hypatia= and =orestes=. ---++ Files Key HA configuration files. _Note: Even in an emergency, there's no reason to edit these files!_ =[[http://pastebin.com/qRAxLpkx][/etc/cluster/cluster.conf]]= <br /> =[[http://pastebin.com/tLyZd09i][/etc/lvm/lvm.conf]]= <br /> =[[http://pastebin.com/H8Kfi2tM][/etc/drbd.d/global_common.conf]]= <br /> =[[http://pastebin.com/1GWupJz8][/etc/drbd.d/admin.res]]= <br /> =[[http://pastebin.com/ca1dRt6Y][/home/bin/fence_nut.pl]]= <br /> =/etc/rc.d/rc.local= <br /> =/home/bin/recover-symlinks.sh= <br /> =/etc/rc.d/rc.fix-pacemaker-delay= (on hypatia only) The links are to an external site, [[http://www.pastebin.com/][pastebin]]; I use this in case I want to consult with someone on the HA setup. If you're reading this from a hardcopy, you can find all these files by visiting [[http://pastebin.com/u/wgseligman]] and searching for =20130103=. ---++ One-time set-up The commands to set up a dual-primary cluster are outlined here. The details can be found in [[http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/index.html][Clusters From Scratch]] and [[https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial][Redhat Cluster Tutorial]]. *Warning: Do not type _any_ of these commands in the hopes of fixing a problem! They will erase the shared DRBD drive.* ---+++ DRBD set-up Edit =[[http://pastebin.com/H8Kfi2tM][/etc/drbd.d/global_common.conf]]= and create =[[http://pastebin.com/1GWupJz8][/etc/drbd.d/admin.res]]=. Then on =hypatia=: <verbatim> /sbin/drbdadm create-md admin /sbin/service drbd start /sbin/drbdadm up admin </verbatim> Then on =orestes=: <verbatim> /sbin/drbdadm --force create-md admin /sbin/service drbd start /sbin/drbdadm up admin </verbatim> Back to =hypatia=: <verbatim> /sbin/drbdadm -- --overwrite-data-of-peer primary admin cat /proc/drbd </verbatim> Keeping looking at the contents of =/proc/drbd=. It will take a while, but eventually the two disks will sync up. Back to =orestes=: <verbatim> /sbin/drbdadm primary admin cat /proc/drbd </verbatim> The result should be something like this: <verbatim> # cat /proc/drbd version: 8.4.1 (api:1/proto:86-100) GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by root@hypatia-tb.nevis.columbia.edu, 2012-02-14 17:04:51 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----- ns:162560777 nr:78408289 dw:240969067 dr:747326438 al:10050 bm:1583 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 </verbatim> [[http://www.drbd.org/users-guide/ch-admin.html#s-proc-drbd][Here's]] a guide to understanding the contents of =/proc/drbd=. ---+++ Clustered LVM setup Most of the following commands only have to be issued on one of the nodes. See [[http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/index.html][Clusters From Scratch]] and [[https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial][Redhat Cluster Tutorial]] for details. * Edit =[[http://pastebin.com/tLyZd09i][/etc/lvm/lvm.conf]]= on both systems; search this file for the initials =WGS= for a complete list of changes. * Change the filter line to search for DRBD partitions: <verbatim>filter = [ "a|/dev/drbd.*|", "a|/dev/md1|", "r|.*|" ]</verbatim> * For lvm locking: <verbatim>locking_type = 3</verbatim> * Edit =/etc/sysconfig/cman= to disable quorum (because there's only two nodes on the cluster): <verbatim>sed -i.sed "s/.*CMAN_QUORUM_TIMEOUT=.*/CMAN_QUORUM_TIMEOUT=0/g" /etc/sysconfig/cman</verbatim> * Create =[[http://pastebin.com/qRAxLpkx][/etc/cluster/cluster.conf]]= to define the two-node Pacemaker configuration, with fencing. * Create a physical volume and a clustered volume group on the DRBD partition. The name of the DRBD disk is =/dev/drbd0=; the name of the volume group is =ADMIN=. <verbatim>pvcreate /dev/drbd0 vgcreate -c y ADMIN /dev/drbd0</verbatim> * For each logical volume in the volume group, create the volume and install a GFS2 filesystem; for example, the following creates a logical volume =usr= within the volume group =ADMIN=: <verbatim>lvcreate -L 200G -n usr ADMIN mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr</verbatim>Note that =Nevis_HA= is the cluster name defined in =/etc/cluster/cluster.conf=. * Make sure that cman, clvm2, and pacemaker daemons will start at boot; on both nodes, do: <verbatim>/sbin/chkconfig cman on /sbin/chkconfig clvmd on /sbin/chkconfig pacemaker on</verbatim> * Reboot both nodes. ---++ Pacemaker configuration ---+++ Commands The configuration has definitely changed from that listed below. To see the [[http://www.nevis.columbia.edu/cluster-status.html][current configuration]], run this as root on either =hypatia= or =orestes=: <verbatim> crm configure show </verbatim> To see the status of all the resources: <verbatim> crm resource status </verbatim> To get a constantly-updated display of the resource status, the following command is the corosync equivalent of "top" (use Ctrl-C to exit): <verbatim> crm_mon </verbatim> You can run the above commands via sudo, but you'll have to extend your path; e.g., <verbatim> export PATH=/sbin:/usr/sbin:${PATH} sudo crm_mon </verbatim> ---+++ Concepts This may help as you work your way through the configuration: <verbatim>crm configure primitive MyIPResource ocf:heartbeat:IPaddr2 params ip=192.168.85.3 \ cidr_netmask=32 op monitor interval=30s timeout=60s # Which is composed of * crm ::= "cluster resource manager", the command we're executing * primitive ::= The type of resource object that we’re creating. * MyIPResource ::= Our name for the resource * IPaddr2 ::= The script to call * ocf ::= The standard it conforms to * ip=192.168.85.3 ::= Parameter(s) as name/value pairs * cidr_netmask ::= netmask; 32-bits means use this exact IP address * op ::== what follows are options * monitor interval=30s ::= check every 30 seconds that this resource is working * timeout ::= how long to wait before you assume an "op" is dead. </verbatim> How to find out which scripts exist, that is, which resources can be controlled by the HA cluster: <verbatim> crm ra classes </verbatim> Based on the result, I looked at: <verbatim> crm ra list ocf heartbeat </verbatim> To find out what IPaddr2 parameters I needed, I used: <verbatim> crm ra meta ocf:heartbeat:IPaddr2 </verbatim> ---+++ Initial configuration guide This work was done in Apr-2012. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. The entire final configuration is on pastebin: [[http://pastebin.com/QcxuvfK0]]. <verbatim> # The commands ultimately used to configure the high-availability (HA) servers: # The beginning: make sure pacemaker is running on both hypatia and orestes: /sbin/service pacemaker status crm node status crm resource status # We'll configure STONITH later (see below) crm configure property stonith-enabled=false # Let's continue by entering the crm utility for short sessions. I'm going to # test groups of commands before I commit them. I omit the "crm configure show' # and "crm status" commands that I frequently typed in, in order to see that # everything was correct. # I also omit some of the standard resource options # (e.g., "... op monitor interval="20" timeout="40" depth="0"...) to make the # commands look simpler. You can see the # complete list with "crm configure show". # DRBD is a service that synchronizes the hard drives between two machines. # When one machine makes any change to the DRBD disk, the other machine # immediately duplicates that change on the block level. We have a dual-primary # configuration, which means both machines can mount the DRBD disk at once. # Start by entering the resource manager. crm # Define a "shadow" configuration, to test things without committing them # to the HA cluster: cib new drbd # The "drbd_resource" parameter points to a configuration defined in /etc/drbd.d/admin.res primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource="admin" \ meta target-role="Master" # The following resources defines how the DRBD resource (AdminDrbd) is to # duplicated ("cloned") among the nodes. The parameters clarify that there are # two copies, one on each node, and both can be the master. ms AdminClone AdminDrbd \ meta master-max="2" master-node-max="1" clone-max="2" clone-node-max="1" \ notify="true" interleave="true" configure show # Looks good, so commit the change. cib commit drbd quit # Now define resources that depend on ordering. crm cib new disk # The DRBD disk is available to the system. The next step is to tell LVM # that the volume group ADMIN exists on the disk. # To find out that there was a resource "ocf:heartbeat:LVM" that I could use, # I used the command: ra classes # Based on the result, I looked at: ra list ocf heartbeat # To find out what LVM parameters I needed, I used: ra meta ocf:heartbeat:LVM # All of the above led me to create the following resource configuration: primitive AdminLvm ocf:heartbeat:LVM \ params volgrpname="ADMIN" # After I set up the volume group, I want to mount the logical volumes # (partitions) within the volume group. Here's one of the partitions, /usr/nevis; # note that I begin all the filesystem resources with FS so they'll be next # to each other when I type "crm configure show". primitive FSUsrNevis ocf:heartbeat:Filesystem \ params device="/dev/mapper/ADMIN-usr" directory="/usr/nevis" fstype="gfs2" \ options="defaults,noatime,nodiratime" # I have similar definitions for the other logical volumes in volume group ADMIN: # /mail, /var/nevis, etc. # Now I'm going to define a resource group. The following command means: # - Put all these resources on the same node; # - Start these resources in the order they're listed; # - The resources depend on each other in the order they're listed. For example, # if AdminLvm fails, FSUsrNevis will not start, or will be stopped if it's running. group FilesystemGroup AdminLvm FSUsrNevis FSVarNevis FSVirtualMachines FSMail FSWork # We want these logical volumes (or partitions or filesystems) to be available # on both nodes. To do this, we define a clone resource. clone FilesystemClone FilesystemGroup meta interleave="true" # It's important that we not try to set up the filesystems # until the DRBD admin resource is running on a node, and has been # promoted to master. # A score of "inf:" means "infinity": 'FileSystemClone' must be on a node on which # 'AdminClone' is in the Master state; if the DRBD resource 'AdminClone' can't # be promoted, then don't start the 'FilesystemClone' resource. (You can use numeric # values instead of infinity, in which case these constraints become suggestions # instead of being mandatory.) colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start cib commit disk quit # Once all the filesystems are mounted, we start can start other resources. Let's # define a set of cloned IP addresses that will always point to at least one of the nodes, # possibly both. crm cib new ip # One address for each network primitive IP_cluster ocf:heartbeat:IPaddr2 \ params ip="129.236.252.11" cidr_netmask="32" nic="eth0" primitive IP_cluster_local ocf:heartbeat:IPaddr2 \ params ip="10.44.7.11" cidr_netmask="32" nic="eth2" primitive IP_cluster_sandbox ocf:heartbeat:IPaddr2 \ params ip="10.43.7.11" cidr_netmask="32" nic="eth0.3" # Group them together group IPGroup IP_cluster IP_cluster_local IP_cluster_sandbox # The option "globally-unique=true" works with IPTABLES to make # sure that ethernet connections are not disrupted even if one of # nodes goes down; see "Clusters From Scratch" for details. clone IPClone IPGroup \ meta globally-unique="true" clone-max="2" clone-node-max="2" interleave="false" # Make sure the filesystems are mounted before starting the IP resources. colocation IP_With_Filesystem inf: IPClone FilesystemClone order Filesystem_Before_IP inf: FilesystemClone IPClone cib commit ip quit # We have to export some of the filesystems via NFS before some of the virtual machines # will be able to run. crm cib new exports # This is an example NFS export resource; I won't list them all here. See # "crm configure show" for the complete list. primitive ExportUsrNevis ocf:heartbeat:exportfs \ description="Site-wide applications installed in /usr/nevis" \ params clientspec="*.nevis.columbia.edu" directory="/usr/nevis" fsid="20" \ options="ro,no_root_squash,async" # Define a group for all the exportfs resources. You can see it's a long list, # which is why I don't list them all explicitly. I had to be careful # about the exportfs definitions; despite the locking mechanims of GFS2, # we'd get into trouble if two external systems tried to write to the same # DRBD partition at once via NFS. group ExportsGroup ExportMail ExportMailInbox ExportMailFolders ExportMailForward ExportMailProcmailrc \ ExportUsrNevisHermes ExportUsrNevis ExportUsrNevisOffsite ExportWWW # Clone the group so both nodes export the partitions. Make sure the # filesystems are mounted before we export them. clone ExportsClone ExportsGroup colocation Exports_With_Filesystem inf: ExportsClone FilesystemClone order Filesystem_Before_Exports inf: FilesystemClone ExportsClone cib commit exports quit # Symlinks: There are some scripts that I want to run under cron. These scripts are # located in the DRBD /var/nevis file system. For them to run via cron, they have to # found in /etc/cron.d somehow. A symlink is the easiest way, and there's a # symlink pacemaker resource to manage this. crm cib new cron # The ambient-temperature script periodically checks the computer room's # environment monitor, and shuts down the cluster if the temperature gets # too high. primitive CronAmbientTemperature ocf:heartbeat:symlink \ description="Shutdown cluster if A/C stops" \ params link="/etc/cron.d/ambient-temperature" target="/var/nevis/etc/cron.d/ambient-temperature" \ backup_suffix=".original" # We don't want to clone this resource; I only want one system to run this script # at any one time. colocation Temperature_With_Filesystem inf: CronAmbientTemperature FilesystemClone order Filesystem_Before_Temperature inf: FilesystemClone CronAmbientTemperature # Every couple of months, make a backup of the virtual machine's disk images. primitive CronBackupVirtualDiskImages ocf:heartbeat:symlink \ description="Periodically save copies of the virtual machines" \ params link="/etc/cron.d/backup-virtual-disk-images" \ target="/var/nevis/etc/cron.d/backup-virtual-disk-images" \ backup_suffix=".original" colocation BackupImages_With_Filesystem inf: CronBackupVirtualDiskImages FilesystemClone order Filesystem_Before_BackupImages inf: FilesystemClone CronBackupVirtualDiskImages cib commit cron quit # These are the most important resources on the HA cluster: the virtual # machines. crm cib new vm # In order to start a virtual machine, the libvirtd daemon has to run. The "lsb:" means # "Linux Standard Base", which in turn means any script located in # /etc/init.d. primitive Libvirtd lsb:libvirtd # libvirtd looks for configuration files that define the virtual machines. # These files are kept in /var/nevis, like the above cron scripts, and are # "placed" via symlinks. primitive SymlinkEtcLibvirt ocf:heartbeat:symlink \ params link="/etc/libvirt" target="/var/nevis/etc/libvirt" backup_suffix=".original" primitive SymlinkQemuSnapshot ocf:heartbeat:symlink \ params link="/var/lib/libvirt/qemu/snapshot" target="/var/nevis/lib/libvirt/qemu/snapshot" \ backup_suffix=".original" # Again, define a group for these resources, clone the group so they # run on both nodes, and make sure they don't run unless the # filesystems are mounted. group LibvirtdGroup SymlinkEtcLibvirt SymlinkQemuSnapshot Libvirtd clone LibvirtdClone LibvirtdGroup colocation Libvirtd_With_Filesystem inf: LibvirtdClone FilesystemClone # A tweak: some virtual machines require the directories exportted # by the exportfs resources defined above. Don't start the VMs until # the exports are complete. order Exports_Before_Libvirtd inf: ExportsClone LibvirtdClone # The typical definition of a resource that runs a VM. I won't list # them all, just the one for the mail server. Note that all the # virtual-machine resource names start with VM_, so they'll show # up next to each other in the output of "crm configure show". # VM migration is a neat feature. If pacemaker has the chance to move # a virtual machine, it can transmit it to another node without stopping it # on the source node and restarting it at the destination. If a machine # crashes, migration can't happen, but it can greatly speed up the # controlled shutdown of a node. primitive VM_franklin ocf:heartbeat:VirtualDomain \ params config="/etc/libvirt/qemu/franklin.xml" \ migration_transport="ssh" meta allow-migrate="true" # We don't want to clone the VMs; it will just confuse things if there # two mail servers (with the same IP address!) running at the same time. colocation Mail_With_Libvirtd inf: VM_franklin LibvirtdClone order Libvirtd_Before_Mail inf: LibvirtdClone VM_franklin cib commit vm quit # A less-critical resource is tftp. As above, we define the basic xinetd # resource found in /etc/init.d, include a configure file with a symlink, # then clone the resource and specify it can't run until the filesystems # are mounted. crm cib new tftp primitive Xinetd lsb:xinetd primitive SymlinkTftp ocf:heartbeat:symlink \ params link="/etc/xinetd.d/tftp" target="/var/nevis/etc/xinetd.d/tftp" \ backup_suffix=".original" group TftpGroup SymlinkTftp Xinetd clone TftpClone TftpGroup colocation Tftp_With_Filesystem inf: TftpClone FilesystemClone order Filesystem_Before_Tftp inf: FilesystemClone TftpClone cib commit tftp quit # More important is dhcpd, which assigns IP addresses dynamically. # Many systems at Nevis require a DHCP server for their IP address, # include the wireless routers. This follows the same pattern as above, # except that we don't clone the dhcpd daemon, since we want only # one DHCP server at Nevis. crm cib new dhcp configure primitive Dhcpd lsb:dhcpd # Associate an IP address with the DHCP server. This is a mild # convenience for the times I update the list of MAC addresses # to be assigned permanent IP addresses. primitive IP_dhcp ocf:heartbeat:IPaddr2 \ params ip="10.44.107.11" cidr_netmask="32" nic="eth2" primitive SymlinkDhcpdConf ocf:heartbeat:symlink \ params link="/etc/dhcp/dhcpd.conf" target="/var/nevis/etc/dhcpd.conf" \ backup_suffix=".original" primitive SymlinkDhcpdLeases ocf:heartbeat:symlink \ params link="/var/lib/dhcpd" target="/var/nevis/dhcpd" \ backup_suffix=".original" primitive SymlinkSysconfigDhcpd ocf:heartbeat:symlink \ params link="/etc/sysconfig/dhcpd" target="/var/nevis/etc/sysconfig/dhcpd"\ backup_suffix=".original" group DhcpGroup SymlinkDhcpdConf SymlinkSysconfigDhcpd SymlinkDhcpdLeases Dhcpd IP_dhcp colocation Dhcp_With_Filesystem inf: DhcpGroup FilesystemClone order Filesystem_Before_Dhcp inf: FilesystemClone DhcpGroup cib commit dhcp quit # An important part of a high-availability configuration is STONITH = "Shoot the # other node in the head." Here's the idea: suppose one node fails for some reason. The # other node will take over as needed. # Suppose the failed node tries to come up again. This can be a problem: The other node # may have accumulated changes that the failed node doesn't know about. There can be # synchronization issues that require manual intervention. # The STONITH mechanism means: If a node fails, the remaining node(s) in a cluster will # force a permanent shutdown of the failed node; it can't automatically come back up again. # This is a special case of "fencing": once a node or resource fails, it can't be allowed # to start up again automatically. # In general, there are many ways to implement a STONITH mechanism. At Nevis, the way # we do it is to shut-off the power on the UPS connected to the failed node. # (By the way, this is why you have to be careful about restarting hypatia or orestes. # The STONITH mechanism may cause the UPS on the restarting # computer to turn off the power; it will never come back up.) # At Nevis, the UPSes are monitored and controlled using the NUT package # <http://www.networkupstools.org/>; details are on the Nevis wiki at # <http://www.nevis.columbia.edu/twiki/bin/view/Nevis/Ups>. # The official corosync distribution from <http://www.clusterlabs.org/> # does not include a script for NUT, so I had to write one. It's located at # /home/bin//home/bin/fence_nut.pl on both hypatia and orestes; there are appropriate links # to this script from /usr/sbin/fence_nut. # The following commands implement the STONITH mechanism for our cluster: crm cib new stonith # The STONITH resource that can potentially shut down hypatia. primitive StonithHypatia stonith:fence_nut \ params stonith-timeout="120s" pcmk_host_check="static-list" \ pcmk_host_list="hypatia.nevis.columbia.edu" ups="hypatia-ups" username="XXXX" \ password="XXXX" cycledelay="20" ondelay="20" offdelay="20" \ noverifyonoff="1" debug="1" # The node that runs the above script cannot be hypatia; it's # not wise to trust a node to STONITH itself. Note that the score # is "negative infinity," which means "never run this resource # on the named node." location StonithHypatia_Location StonithHypatia -inf: hypatia.nevis.columbia.edu # The STONITH resource that can potentially shut down orestes. primitive StonithOrestes stonith:fence_nut \ params stonith-timeout="120s" pcmk_host_check="static-list" \ pcmk_host_list="orestes.nevis.columbia.edu" ups="orestes-ups" username="XXXX" \ password="XXXX" cycledelay="20" ondelay="20" offdelay="20" \ noverifyonoff="1" debug="1" # Again, orestes cannot be the node that runs the above script. location StonithOrestes_Location StonithOrestes -inf: orestes.nevis.columbia.edu cib commit stonith quit # Now turn the STONITH mechanism on for the cluster. crm configure property stonith-enabled=true </verbatim> Again, the final configuration that results from the above commands is on [[http://pastebin.com/fRJMAYa6][pastebin]].
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r11
<
r10
<
r9
<
r8
<
r7
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r11 - 2014-07-01
-
WilliamSeligman
Main
Log In
or
Register
Main Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
Webs
ATLAS
DOE
DZero
FutureTev
Main
TWiki
Veritas
Copyright © 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback