Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Nevis particle-physics administrative cluster configuration | ||||||||
Line: 6 to 6 | ||||||||
Added: | ||||||||
> > | Archived 20-Sep-2013: The high-availability cluster has been set aside in favor of a more traditional single-box admin server. HA is grand in theory, but in the three years we operated the cluster we had no hardware problems which the HA set-up would have prevented, but many hours of downtime due to problems with the HA software. This mailing-list post![]() | |||||||
This is a reference page. It contains a text file that describes how the high-availability pacemaker![]() hypatia and orestes .
Files |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Nevis particle-physics administrative cluster configuration | ||||||||
Line: 151 to 151 | ||||||||
Initial configuration guide | ||||||||
Changed: | ||||||||
< < | This work was done in Apr-2012. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. The entire final configuration is on pastebin: http://pastebin.com/fRJMAYa6![]() | |||||||
> > | This work was done in Apr-2012. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. The entire final configuration is on pastebin: http://pastebin.com/QcxuvfK0![]() | |||||||
# The commands ultimately used to configure the high-availability (HA) servers: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Nevis particle-physics administrative cluster configuration | ||||||||
Line: 98 to 98 | ||||||||
Commands | ||||||||
Changed: | ||||||||
< < | The configuration has definitely changed from that listed below. To see the current configuration, run this as root on either hypatia or orestes : | |||||||
> > | The configuration has definitely changed from that listed below. To see the current configuration![]() hypatia or orestes : | |||||||
crm configure show |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Nevis particle-physics administrative cluster configuration | ||||||||
Line: 78 to 78 | ||||||||
| ||||||||
Changed: | ||||||||
< < |
| |||||||
> > |
| |||||||
| ||||||||
Changed: | ||||||||
< < |
| |||||||
> > |
| |||||||
vgcreate -c y ADMIN /dev/drbd0 | ||||||||
Changed: | ||||||||
< < |
| |||||||
> > |
| |||||||
mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usrNote that Nevis_HA is the cluster name defined in /etc/cluster/cluster.conf .
|
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Nevis particle-physics administrative cluster configuration | ||||||||
Line: 151 to 151 | ||||||||
Initial configuration guide | ||||||||
Changed: | ||||||||
< < | This work was done in Apr-2012. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. | |||||||
> > | This work was done in Apr-2012. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. The entire final configuration is on pastebin: http://pastebin.com/fRJMAYa6![]() | |||||||
# The commands ultimately used to configure the high-availability (HA) servers: | ||||||||
Line: 562 to 562 | ||||||||
Added: | ||||||||
> > | Again, the final configuration that results from the above commands is on pastebin![]() | |||||||
|
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Nevis particle-physics administrative cluster configuration | ||||||||
Line: 121 to 121 | ||||||||
This may help as you work your way through the configuration:
crm configure primitive MyIPResource ocf:heartbeat:IPaddr2 params ip=192.168.85.3 \ | ||||||||
Changed: | ||||||||
< < | cidr_netmask=32 op monitor interval=30s | |||||||
> > | cidr_netmask=32 op monitor interval=30s timeout=60s | |||||||
# Which is composed of * crm ::= "cluster resource manager", the command we're executing | ||||||||
Line: 133 to 133 | ||||||||
* cidr_netmask ::= netmask; 32-bits means use this exact IP address * op ::== what follows are options * monitor interval=30s ::= check every 30 seconds that this resource is working | ||||||||
Changed: | ||||||||
< < | # ... timeout = how to long wait before you assume a resource is dead. | |||||||
> > | * timeout ::= how long to wait before you assume an "op" is dead. | |||||||
How to find out which scripts exist, that is, which resources can be controlled by the HA cluster: | ||||||||
Line: 172 to 171 | ||||||||
# and "crm status" commands that I frequently typed in, in order to see that # everything was correct. | ||||||||
Changed: | ||||||||
< < | # I also omit the standard resource options | |||||||
> > | # I also omit some of the standard resource options | |||||||
# (e.g., "... op monitor interval="20" timeout="40" depth="0"...) to make the | ||||||||
Changed: | ||||||||
< < | # commands look simpler. This particular option means to check that the # resource is running every 20 seconds, and to declare that the monitor operation # will generate an error if 40 seconds elapse without a response. You can see the | |||||||
> > | # commands look simpler. You can see the | |||||||
# complete list with "crm configure show". # DRBD is a service that synchronizes the hard drives between two machines. | ||||||||
Line: 216 to 213 | ||||||||
crm cib new disk | ||||||||
Changed: | ||||||||
< < | # The DRBD is available to the system. The next step is to tell LVM | |||||||
> > | # The DRBD disk is available to the system. The next step is to tell LVM | |||||||
# that the volume group ADMIN exists on the disk. # To find out that there was a resource "ocf:heartbeat:LVM" that I could use, | ||||||||
Line: 261 to 258 | ||||||||
clone FilesystemClone FilesystemGroup meta interleave="true" | ||||||||
Changed: | ||||||||
< < | # One more thing: It's important that we not try to set up the filesystems | |||||||
> > | # It's important that we not try to set up the filesystems | |||||||
# until the DRBD admin resource is running on a node, and has been # promoted to master. | ||||||||
Changed: | ||||||||
< < | # A score of "inf" means "infinity"; if the DRBD resource 'AdminClone' can't # be promoted, then don't start the 'FilesystemClone' resource. | |||||||
> > | # A score of "inf:" means "infinity": 'FileSystemClone' must be on a node on which # 'AdminClone' is in the Master state; if the DRBD resource 'AdminClone' can't # be promoted, then don't start the 'FilesystemClone' resource. (You can use numeric # values instead of infinity, in which case these constraints become suggestions # instead of being mandatory.) | |||||||
colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start | ||||||||
Line: 274 to 274 | ||||||||
cib commit disk quit | ||||||||
Changed: | ||||||||
< < | # Some standard Linux services are under corosync's control. They depend on some or # all of the filesystems being mounted. # Let's start with a simple one: enable the printing service (cups): | |||||||
> > | # Once all the filesystems are mounted, we start can start other resources. Let's # define a set of cloned IP addresses that will always point to at least one of the nodes, # possibly both. | |||||||
crm | ||||||||
Changed: | ||||||||
< < | cib new printing | |||||||
> > | cib new ip # One address for each network | |||||||
Changed: | ||||||||
< < | # lsb = "Linux Standard Base." It just means any service which is # controlled by the one of the standard scripts in /etc/init.d | |||||||
> > | primitive IP_cluster ocf:heartbeat:IPaddr2 params ip="129.236.252.11" cidr_netmask="32" nic="eth0" primitive IP_cluster_local ocf:heartbeat:IPaddr2 params ip="10.44.7.11" cidr_netmask="32" nic="eth2" primitive IP_cluster_sandbox ocf:heartbeat:IPaddr2 params ip="10.43.7.11" cidr_netmask="32" nic="eth0.3" | |||||||
Changed: | ||||||||
< < | configure primitive Cups lsb:cups | |||||||
> > | # Group them together | |||||||
Changed: | ||||||||
< < | # Cups stores its spool files in /var/spool/cups. If the cups service # were to switch to a different server, we want the new server to see # the spooled files. So create /var/nevis/cups, link it with: # mv /var/spool/cups /var/spool/cups.ori # ln -sf /var/nevis/cups /var/spool/cups # and demand that the cups service only start if /var/nevis (and the other # high-availability directories) have been mounted. | |||||||
> > | group IPGroup IP_cluster IP_cluster_local IP_cluster_sandbox | |||||||
Changed: | ||||||||
< < | configure colocation CupsWithVar inf: Cups AdminDirectoriesGroup | |||||||
> > | # The option "globally-unique=true" works with IPTABLES to make # sure that ethernet connections are not disrupted even if one of # nodes goes down; see "Clusters From Scratch" for details. | |||||||
Changed: | ||||||||
< < | # In order to prevent chaos, make sure that the high-availability directories # have been mounted before we try to start cups. | |||||||
> > | clone IPClone IPGroup meta globally-unique="true" clone-max="2" clone-node-max="2" interleave="false" | |||||||
Changed: | ||||||||
< < | configure order VarBeforeCups inf: AdminDirectoriesGroup Cups | |||||||
> > | # Make sure the filesystems are mounted before starting the IP resources. colocation IP_With_Filesystem inf: IPClone FilesystemClone order Filesystem_Before_IP inf: FilesystemClone IPClone | |||||||
Changed: | ||||||||
< < | cib commit printing | |||||||
> > | cib commit ip | |||||||
quit | ||||||||
Changed: | ||||||||
< < | # The other services (xinetd, dhcpd) follow the same pattern as above: # Make sure the services start on the same machine as the admin directories, # and after the admin directories are successfully mounted. | |||||||
> > | # We have to export some of the filesystems via NFS before some of the virtual machines # will be able to run. | |||||||
crm | ||||||||
Changed: | ||||||||
< < | cib new services | |||||||
> > | cib new exports | |||||||
Changed: | ||||||||
< < | configure primitive Xinetd lsb:xinetd configure primitive Dhcpd lsb:dhcpd | |||||||
> > | # This is an example NFS export resource; I won't list them all here. See # "crm configure show" for the complete list. primitive ExportUsrNevis ocf:heartbeat:exportfs description="Site-wide applications installed in /usr/nevis" params clientspec="*.nevis.columbia.edu" directory="/usr/nevis" fsid="20" options="ro,no_root_squash,async" # Define a group for all the exportfs resources. You can see it's a long list, # which is why I don't list them all explicitly. I had to be careful # about the exportfs definitions; despite the locking mechanims of GFS2, # we'd get into trouble if two external systems tried to write to the same # DRBD partition at once via NFS. | |||||||
Changed: | ||||||||
< < | configure colocation XinetdWithVar inf: Xinetd AdminDirectoriesGroup configure order VarBeforeXinetd inf: VarDirectory Xinetd | |||||||
> > | group ExportsGroup ExportMail ExportMailInbox ExportMailFolders ExportMailForward ExportMailProcmailrc ExportUsrNevisHermes ExportUsrNevis ExportUsrNevisOffsite ExportWWW | |||||||
Changed: | ||||||||
< < | configure colocation DhcpdWithVar inf: Dhcpd AdminDirectoriesGroup configure order VarBeforeDhcpd inf: VarDirectory Dhcpd | |||||||
> > | # Clone the group so both nodes export the partitions. Make sure the # filesystems are mounted before we export them. | |||||||
Changed: | ||||||||
< < | cib commit services | |||||||
> > | clone ExportsClone ExportsGroup colocation Exports_With_Filesystem inf: ExportsClone FilesystemClone order Filesystem_Before_Exports inf: FilesystemClone ExportsClone cib commit exports | |||||||
quit | ||||||||
Changed: | ||||||||
< < | # The high-availability servers export some of the admin directories to other # systems, both real and virtual; for example, the /usr/nevis directory is # exported to all the other machines on the Nevis Linux cluster. # NFS exporting of a shared directory can be a little tricky. As with CUPS # spooling, we want to preserve the NFS export state in a way that the # backup server can pick it up. The safest way to do this is to create a # small separate LVM partition ("nfs") and mount it as "/var/lib/nfs", # the NFS directory that contains files that keep track of the NFS state. | |||||||
> > | # Symlinks: There are some scripts that I want to run under cron. These scripts are # located in the DRBD /var/nevis file system. For them to run via cron, they have to # found in /etc/cron.d somehow. A symlink is the easiest way, and there's a # symlink pacemaker resource to manage this. | |||||||
crm | ||||||||
Changed: | ||||||||
< < | cib new nfs | |||||||
> > | cib new cron | |||||||
Changed: | ||||||||
< < | # Define the mount for the NFS state directory /var/lib/nfs | |||||||
> > | # The ambient-temperature script periodically checks the computer room's # environment monitor, and shuts down the cluster if the temperature gets # too high. primitive CronAmbientTemperature ocf:heartbeat:symlink description="Shutdown cluster if A/C stops" params link="/etc/cron.d/ambient-temperature" target="/var/nevis/etc/cron.d/ambient-temperature" backup_suffix=".original" # We don't want to clone this resource; I only want one system to run this script # at any one time. colocation Temperature_With_Filesystem inf: CronAmbientTemperature FilesystemClone order Filesystem_Before_Temperature inf: FilesystemClone CronAmbientTemperature # Every couple of months, make a backup of the virtual machine's disk images. primitive CronBackupVirtualDiskImages ocf:heartbeat:symlink description="Periodically save copies of the virtual machines" params link="/etc/cron.d/backup-virtual-disk-images" target="/var/nevis/etc/cron.d/backup-virtual-disk-images" backup_suffix=".original" colocation BackupImages_With_Filesystem inf: CronBackupVirtualDiskImages FilesystemClone order Filesystem_Before_BackupImages inf: FilesystemClone CronBackupVirtualDiskImages | |||||||
Changed: | ||||||||
< < | configure primitive NfsStateDirectory ocf:heartbeat:Filesystem params device="/dev/admin/nfs" directory="/var/lib/nfs" fstype="ext4" configure colocation NfsStateWithVar inf: NfsStateDirectory AdminDirectoriesGroup configure order VarBeforeNfsState inf: AdminDirectoriesGroup NfsStateDirectory | |||||||
> > | cib commit cron quit | |||||||
Changed: | ||||||||
< < | # Now that the NFS state directory is mounted, we can start nfslockd. Note that # that we're starting NFS lock on both the primary and secondary HA systems; # by default a "clone" resource is started on all systems in a cluster. | |||||||
> > | # These are the most important resources on the HA cluster: the virtual # machines. | |||||||
Changed: | ||||||||
< < | # (Placing nfslockd under the control of Pacemaker turned out to be key to # successful transfer of cluster services to another node. The nfslockd and # nfs daemon information stored in /var/lib/nfs have to be consistent.) | |||||||
> > | crm cib new vm | |||||||
Changed: | ||||||||
< < | configure primitive NfsLockInstance lsb:nfslock configure clone NfsLock NfsLockInstance | |||||||
> > | # In order to start a virtual machine, the libvirtd daemon has to run. The "lsb:" means # "Linux Standard Base", which in turn means any script located in # /etc/init.d. | |||||||
Changed: | ||||||||
< < | configure order NfsStateBeforeNfsLock inf: NfsStateDirectory NfsLock | |||||||
> > | primitive Libvirtd lsb:libvirtd | |||||||
Changed: | ||||||||
< < | # Once nfslockd has been set up, we can start NFS. (We say to colocate # NFS with 'NfsStateDirectory', instead of nfslockd, because nfslockd # is going to be started on both nodes.) | |||||||
> > | # libvirtd looks for configuration files that define the virtual machines. # These files are kept in /var/nevis, like the above cron scripts, and are # "placed" via symlinks. | |||||||
Changed: | ||||||||
< < | configure primitive Nfs lsb:nfs configure colocation NfsWithNfsState inf: Nfs NfsStateDirectory configure order NfsLockBeforeNfs inf: NfsLock Nfs | |||||||
> > | primitive SymlinkEtcLibvirt ocf:heartbeat:symlink params link="/etc/libvirt" target="/var/nevis/etc/libvirt" backup_suffix=".original" primitive SymlinkQemuSnapshot ocf:heartbeat:symlink params link="/var/lib/libvirt/qemu/snapshot" target="/var/nevis/lib/libvirt/qemu/snapshot" backup_suffix=".original" | |||||||
Changed: | ||||||||
< < | cib commit nfs quit | |||||||
> > | # Again, define a group for these resources, clone the group so they # run on both nodes, and make sure they don't run unless the # filesystems are mounted. | |||||||
Added: | ||||||||
> > | group LibvirtdGroup SymlinkEtcLibvirt SymlinkQemuSnapshot Libvirtd clone LibvirtdClone LibvirtdGroup colocation Libvirtd_With_Filesystem inf: LibvirtdClone FilesystemClone | |||||||
Changed: | ||||||||
< < | # The whole point of the entire setup is to be able to run guest virtual machines # under the control of the high-availability service. Here is the set-up for one example # virtual machine. I previously created the hogwarts virtual machine and copied its # configuration to /xen/configs/hogwarts.cfg. | |||||||
> > | # A tweak: some virtual machines require the directories exportted # by the exportfs resources defined above. Don't start the VMs until # the exports are complete. | |||||||
Changed: | ||||||||
< < | # I duplicated the same procedure for franklin (mail server), ada (web server), and # so on, but I don't show that here. | |||||||
> > | order Exports_Before_Libvirtd inf: ExportsClone LibvirtdClone | |||||||
Changed: | ||||||||
< < | crm cib new hogwarts | |||||||
> > | # The typical definition of a resource that runs a VM. I won't list # them all, just the one for the mail server. Note that all the # virtual-machine resource names start with VM_, so they'll show # up next to each other in the output of "crm configure show". | |||||||
Changed: | ||||||||
< < | # Give the virtual machine a long stop interval before flagging an error. # Sometimes it takes a while for Linux to shut down. | |||||||
> > | # VM migration is a neat feature. If pacemaker has the chance to move # a virtual machine, it can transmit it to another node without stopping it # on the source node and restarting it at the destination. If a machine # crashes, migration can't happen, but it can greatly speed up the # controlled shutdown ofa node. | |||||||
Changed: | ||||||||
< < | configure primitive Hogwarts ocf:heartbeat:Xen params xmfile="/xen/configs/Hogwarts.cfg" op stop interval="0" timeout="240" | |||||||
> > | primitive VM_franklin ocf:heartbeat:VirtualDomain params config="/etc/libvirt/qemu/franklin.xml" \ migration_transport="ssh" meta allow-migrate="true" | |||||||
Changed: | ||||||||
< < | # All the virtual machine files are stored in the /xen partition, which is one # of the high-availability admin directories. The virtual machine must run on # the system with this directory. | |||||||
> > | # We don't want to clone the VMs; it will just confuse things if there # two mail servers (with the same IP address!) running at the same time. | |||||||
Changed: | ||||||||
< < | configure colocation HogwartsWithDirectories inf: Hogwarts AdminDirectoriesGroup | |||||||
> > | colocation Mail_With_Libvirtd inf: VM_franklin LibvirtdClone order Libvirtd_Before_Mail inf: LibvirtdClone VM_franklin | |||||||
Changed: | ||||||||
< < | # All of the virtual machines depend on NFS-mounting directories which # are exported by the HA server. The safest thing to do is to make sure # NFS is running on the HA server before starting the virtual machine. | |||||||
> > | cib commit vm quit # A less-critical resource is tftp. As above, we define the basic xinetd # resource found in /etc/init.d, include a configure file with a symlink, # then clone the resource and specify it can't run until the filesystems # are mounted. crm cib new tftp | |||||||
Changed: | ||||||||
< < | configure order NfsBeforeHogwarts inf: Nfs Hogwarts | |||||||
> > | primitive Xinetd lsb:xinetd primitive SymlinkTftp ocf:heartbeat:symlink params link="/etc/xinetd.d/tftp" target="/var/nevis/etc/xinetd.d/tftp" backup_suffix=".original" group TftpGroup SymlinkTftp Xinetd clone TftpClone TftpGroup colocation Tftp_With_Filesystem inf: TftpClone FilesystemClone order Filesystem_Before_Tftp inf: FilesystemClone TftpClone | |||||||
Changed: | ||||||||
< < | cib commit hogwarts | |||||||
> > | cib commit tftp | |||||||
quit | ||||||||
Added: | ||||||||
> > | # More important is dhcpd, which assigns IP addresses dynamically. # Many systems at Nevis require a DHCP server for their IP address, # include the wireless routers. This follows the same pattern as above, # except that we don't clone the dhcpd daemon, since we want only # one DHCP server at Nevis. crm cib new dhcp configure primitive Dhcpd lsb:dhcpd # Associate an IP address with the DHCP server. This is a mild # convenience for the times I update the list of MAC addresses # to be assigned permanent IP addresses. primitive IP_dhcp ocf:heartbeat:IPaddr2 params ip="10.44.107.11" cidr_netmask="32" nic="eth2" primitive SymlinkDhcpdConf ocf:heartbeat:symlink params link="/etc/dhcp/dhcpd.conf" target="/var/nevis/etc/dhcpd.conf" backup_suffix=".original" primitive SymlinkDhcpdLeases ocf:heartbeat:symlink params link="/var/lib/dhcpd" target="/var/nevis/dhcpd" backup_suffix=".original" primitive SymlinkSysconfigDhcpd ocf:heartbeat:symlink params link="/etc/sysconfig/dhcpd" target="/var/nevis/etc/sysconfig/dhcpd" backup_suffix=".original" group DhcpGroup SymlinkDhcpdConf SymlinkSysconfigDhcpd SymlinkDhcpdLeases Dhcpd IP_dhcp colocation Dhcp_With_Filesystem inf: DhcpGroup FilesystemClone order Filesystem_Before_Dhcp inf: FilesystemClone DhcpGroup cib commit dhcp quit | |||||||
# An important part of a high-availability configuration is STONITH = "Shoot the # other node in the head." Here's the idea: suppose one node fails for some reason. The | ||||||||
Line: 429 to 518 | ||||||||
# The official corosync distribution from <http://www.clusterlabs.org/> # does not include a script for NUT, so I had to write one. It's located at | ||||||||
Changed: | ||||||||
< < | # /home/bin/nut.sh on both hypatia and orestes; there are appropriate links # to this script from the stonith/external directory. | |||||||
> > | # /home/bin//home/bin/fence_nut.pl on both hypatia and orestes; there are appropriate links # to this script from /usr/sbin/fence_nut. | |||||||
# The following commands implement the STONITH mechanism for our cluster: | ||||||||
Line: 439 to 528 | ||||||||
# The STONITH resource that can potentially shut down hypatia. | ||||||||
Changed: | ||||||||
< < | configure primitive HypatiaStonith stonith:external/nut params hostname="hypatia.nevis.columbia.edu" ups="hypatia-ups" username="admin" password="acdc" | |||||||
> > | primitive StonithHypatia stonith:fence_nut params stonith-timeout="120s" pcmk_host_check="static-list" pcmk_host_list="hypatia.nevis.columbia.edu" ups="hypatia-ups" username="XXXX" password="XXXX" cycledelay="20" ondelay="20" offdelay="20" noverifyonoff="1" debug="1" | |||||||
# The node that runs the above script cannot be hypatia; it's # not wise to trust a node to STONITH itself. Note that the score # is "negative infinity," which means "never run this resource # on the named node." | ||||||||
Changed: | ||||||||
< < | configure location HypatiaStonithLoc HypatiaStonith -inf: hypatia.nevis.columbia.edu | |||||||
> > | location StonithHypatia_Location StonithHypatia -inf: hypatia.nevis.columbia.edu | |||||||
# The STONITH resource that can potentially shut down orestes. | ||||||||
Changed: | ||||||||
< < | configure primitive OrestesStonith stonith:external/nut params hostname="orestes.nevis.columbia.edu" ups="orestes-ups" username="admin" password="acdc" | |||||||
> > | primitive StonithOrestes stonith:fence_nut params stonith-timeout="120s" pcmk_host_check="static-list" pcmk_host_list="orestes.nevis.columbia.edu" ups="orestes-ups" username="XXXX" password="XXXX" cycledelay="20" ondelay="20" offdelay="20" noverifyonoff="1" debug="1" | |||||||
# Again, orestes cannot be the node that runs the above script. | ||||||||
Changed: | ||||||||
< < | configure location OresetesStonithLoc OrestesStonith -inf: orestes.nevis.columbia.edu | |||||||
> > | location StonithOrestes_Location StonithOrestes -inf: orestes.nevis.columbia.edu | |||||||
cib commit stonith quit |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Nevis particle-physics administrative cluster configuration | ||||||||
Line: 21 to 21 | ||||||||
/home/bin/recover-symlinks.sh /etc/rc.d/rc.fix-pacemaker-delay (on hypatia only) | ||||||||
Changed: | ||||||||
< < | The links are to an external site, pastebin![]() http://www.pastbin.com and searching for wgseligman 20130103 . | |||||||
> > | The links are to an external site, pastebin![]() ![]() 20130103 . | |||||||
One-time set-up | ||||||||
Line: 72 to 72 | ||||||||
Clustered LVM setup | ||||||||
Changed: | ||||||||
< < | The following commands only have to be issued on one of the nodes. | |||||||
> > | Most of the following commands only have to be issued on one of the nodes. See Clusters From Scratch![]() ![]() | |||||||
| ||||||||
Line: 94 to 94 | ||||||||
| ||||||||
Changed: | ||||||||
< < | Commands | |||||||
> > | Pacemaker configurationCommands | |||||||
The configuration has definitely changed from that listed below. To see the current configuration, run this as root on either hypatia or orestes :
crm configure show | ||||||||
Added: | ||||||||
> > | To see the status of all the resources:
crm resource status | |||||||
To get a constantly-updated display of the resource status, the following command is the corosync equivalent of "top" (use Ctrl-C to exit):
crm_mon | ||||||||
Line: 110 to 116 | ||||||||
sudo crm_mon | ||||||||
Changed: | ||||||||
< < | Concepts | |||||||
> > | Concepts | |||||||
This may help as you work your way through the configuration: | ||||||||
Changed: | ||||||||
< < | crm configure primitive IP ocf:heartbeat:IPaddr2 params ip=192.168.85.3 \ | |||||||
> > | crm configure primitive MyIPResource ocf:heartbeat:IPaddr2 params ip=192.168.85.3 \ | |||||||
cidr_netmask=32 op monitor interval=30s # Which is composed of * crm ::= "cluster resource manager", the command we're executing * primitive ::= The type of resource object that we’re creating. | ||||||||
Changed: | ||||||||
< < | * IP ::= Our name for the resource | |||||||
> > | * MyIPResource ::= Our name for the resource | |||||||
* IPaddr2 ::= The script to call * ocf ::= The standard it conforms to * ip=192.168.85.3 ::= Parameter(s) as name/value pairs | ||||||||
Line: 144 to 150 | ||||||||
crm ra meta ocf:heartbeat:IPaddr2 | ||||||||
Changed: | ||||||||
< < | Configuration | |||||||
> > | Initial configuration guide | |||||||
Changed: | ||||||||
< < | This work was done in Sep-2010, with major revisions for stability in Aug-2011. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. | |||||||
> > | This work was done in Apr-2012. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. | |||||||
# The commands ultimately used to configure the high-availability (HA) servers: | ||||||||
Changed: | ||||||||
< < | # The beginning: make sure corosync is running on both hypatia and orestes: /sbin/service corosync start # The following line is needed because we have only two machines in # the HA cluster. | |||||||
> > | # The beginning: make sure pacemaker is running on both hypatia and orestes: | |||||||
Changed: | ||||||||
< < | crm configure property no-quorum-policy=ignore | |||||||
> > | /sbin/service pacemaker status crm node status crm resource status | |||||||
# We'll configure STONITH later (see below) crm configure property stonith-enabled=false | ||||||||
Deleted: | ||||||||
< < | # Define IP addresses to be managed by the HA systems. crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 params ip=129.236.252.11 cidr_netmask=32 op monitor interval=30s crm configure primitive LocalIP ocf:heartbeat:IPaddr2 params ip=10.44.7.11 cidr_netmask=32 op monitor interval=30s crm configure primitive SandboxIP ocf:heartbeat:IPaddr2 params ip=10.43.7.11 cidr_netmask=32 op monitor interval=30s # Group these together, so they'll all be assigned to the same machine. # The name of the group is "MainIPGroup". crm configure group MainIPGroup ClusterIP LocalIP SandboxIP | |||||||
# Let's continue by entering the crm utility for short sessions. I'm going to | ||||||||
Changed: | ||||||||
< < | # test groups of commands before I commit them. (I omit the "configure show' # and "status" commands that I frequently typed in, in order to see that # everything was correct.) | |||||||
> > | # test groups of commands before I commit them. I omit the "crm configure show' # and "crm status" commands that I frequently typed in, in order to see that # everything was correct. # I also omit the standard resource options # (e.g., "... op monitor interval="20" timeout="40" depth="0"...) to make the # commands look simpler. This particular option means to check that the # resource is running every 20 seconds, and to declare that the monitor operation # will generate an error if 40 seconds elapse without a response. You can see the # complete list with "crm configure show". | |||||||
# DRBD is a service that synchronizes the hard drives between two machines. | ||||||||
Changed: | ||||||||
< < | # For our cluster, one machine will have access to the "master" copy # and make all the changes to that copy; the other machine will have the # "slave" copy and mindlessly duplicate all the changes. # I previously configured the DRBD resources 'admin' and 'work'. What the # following commands do is put the maintenance of these resources under # the control of Pacemaker. | |||||||
> > | # When one machine makes any change to the DRBD disk, the other machine # immediately duplicates that change on the block level. We have a dual-primary # configuration, which means both machines can mount the DRBD disk at once. # Start by entering the resource manager. | |||||||
crm | ||||||||
Added: | ||||||||
> > | ||||||||
# Define a "shadow" configuration, to test things without committing them # to the HA cluster: cib new drbd | ||||||||
Line: 197 to 192 | ||||||||
# to the HA cluster: cib new drbd | ||||||||
Changed: | ||||||||
< < | # The "drbd_resource" parameter points to a configuration defined in /etc/drbd.d/ configure primitive AdminDrbd ocf:linbit:drbd params drbd_resource=admin op monitor interval=60s # DRBD functions with a "master/slave" setup as described above. The following command # defines the name of the master disk partition ("Admin"). The remaining parameters # clarify that there are two copies, but only one can be the master, and # at most one can be a slave. configure master Admin AdminDrbd meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true globally-unique=false # The machine that gets the master copy (the one that will make changes to the drive) # should also be the one with the main IP address. | |||||||
> > | # The "drbd_resource" parameter points to a configuration defined in /etc/drbd.d/admin.res | |||||||
Changed: | ||||||||
< < | configure colocation AdminWithMainIP inf: MainIPGroup Admin:Master | |||||||
> > | primitive AdminDrbd ocf:linbit:drbd params drbd_resource="admin" meta target-role="Master" # The following resources defines how the DRBD resource (AdminDrbd) is to # duplicated ("cloned") among the nodes. The parameters clarify that there are # two copies, one on each node, and both can be the master. ms AdminClone AdminDrbd meta master-max="2" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" interleave="true" | |||||||
Changed: | ||||||||
< < | # We want to wait before assigning IPs to a node until we know that # Admin has been promoted to master on that node. configure order AdminBeforeMainIP inf: Admin:promote MainIPGroup # I like these commands, so commit them to the running configuration. cib commit drbd # Things look good, so let's add another disk resource. I defined another drbd resource # with some spare disk space, called "work". The idea is that I can play with alternate # virtual machines and save them on "work" before I copy them to the more robust "admin". configure primitive WorkDrbd ocf:linbit:drbd params drbd_resource=work op monitor interval=60s configure master Work WorkDrbd meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true globally-unique=false # I prefer the work directory to be on the main admin box, but it doesn't have to be. "500:" is # weighting factor; compare it to "inf:" (for infinity) which is used in most of these commands. configure colocation WorkPrefersMain 500: Work:Master MainIPGroup # Given a choice, try to put the Admin:Master on hypatia configure location DefinePreferredMainNode Admin 100: hypatia.nevis.columbia.edu | |||||||
> > | configure show | |||||||
Added: | ||||||||
> > | # Looks good, so commit the change. | |||||||
cib commit drbd quit | ||||||||
Changed: | ||||||||
< < | # Now try a resource that depends on ordering: On the node that has the master # resource for "work," mount that disk image as /work. | |||||||
> > | # Now define resources that depend on ordering. | |||||||
crm | ||||||||
Changed: | ||||||||
< < | cib new workdisk | |||||||
> > | cib new disk # The DRBD is available to the system. The next step is to tell LVM # that the volume group ADMIN exists on the disk. | |||||||
Changed: | ||||||||
< < | # To find out that there was an "ocf:heartbeat:Filesystem" that I could use, | |||||||
> > | # To find out that there was a resource "ocf:heartbeat:LVM" that I could use, | |||||||
# I used the command: ra classes | ||||||||
Line: 255 to 227 | ||||||||
ra list ocf heartbeat | ||||||||
Changed: | ||||||||
< < | # To find out what Filesystem parameters I needed, I used: | |||||||
> > | # To find out what LVM parameters I needed, I used: | |||||||
Changed: | ||||||||
< < | ra meta ocf:heartbeat:Filesystem | |||||||
> > | ra meta ocf:heartbeat:LVM | |||||||
# All of the above led me to create the following resource configuration: | ||||||||
Changed: | ||||||||
< < | configure primitive WorkDirectory ocf:heartbeat:Filesystem params device="/dev/drbd2" directory="/work" fstype="ext4" | |||||||
> > | primitive AdminLvm ocf:heartbeat:LVM params volgrpname="ADMIN" | |||||||
Changed: | ||||||||
< < | # Note that I had previously created an ext4 filesystem on /dev/drbd2. | |||||||
> > | # After I set up the volume group, I want to mount the logical volumes # (partitions) within the volume group. Here's one of the partitions, /usr/nevis; # note that I begin all the filesystem resources with FS so they'll be next # to each other when I type "crm configure show". | |||||||
Changed: | ||||||||
< < | # Now specify that we want this to be on the same node as Work:Master: | |||||||
> > | primitive FSUsrNevis ocf:heartbeat:Filesystem params device="/dev/mapper/ADMIN-usr" directory="/usr/nevis" fstype="gfs2" options="defaults,noatime,nodiratime" | |||||||
Changed: | ||||||||
< < | configure colocation DirectoryWithWork inf: WorkDirectory Work:Master | |||||||
> > | # I have similar definitions for the other logical volumes in volume group ADMIN: # /mail, /var/nevis, etc. | |||||||
Changed: | ||||||||
< < | # One more thing: It's important that we not try to mount the directory # until after Work has been promoted to master on the node. | |||||||
> > | # Now I'm going to define a resource group. The following command means: # - Put all these resources on the same node; # - Start these resources in the order they're listed; # - The resources depend on each other in the order they're listed. For example, # if AdminLvm fails, FSUsrNevis will not start, or will be stopped if it's running. | |||||||
Changed: | ||||||||
< < | # A score of "inf" means "infinity"; if the DRBD resource 'work' can't # be set up, then don't mount the /work partition. | |||||||
> > | group FilesystemGroup AdminLvm FSUsrNevis FSVarNevis FSVirtualMachines FSMail FSWork | |||||||
Changed: | ||||||||
< < | configure order WorkBeforeDirectory inf: Work:promote WorkDirectory:start | |||||||
> > | # We want these logical volumes (or partitions or filesystems) to be available # on both nodes. To do this, we define a clone resource. | |||||||
Changed: | ||||||||
< < | cib commit workdisk quit # We've made the relatively-unimportant DRBD resource 'work' function. Let's do it for 'admin'. # Previously I created some LVM volumes on the admin DRBD master. We need to use a # resource to active them, but we can't activate them until after the Admin:Master # is loaded. crm cib new lvm | |||||||
> > | clone FilesystemClone FilesystemGroup meta interleave="true" | |||||||
Changed: | ||||||||
< < | # Activate the LVM volumes, but only after DRBD has figured out where # Admin:Master is located. | |||||||
> > | # One more thing: It's important that we not try to set up the filesystems # until the DRBD admin resource is running on a node, and has been # promoted to master. | |||||||
Changed: | ||||||||
< < | configure primitive Lvm ocf:heartbeat:LVM params volgrpname="admin" configure colocation LvmWithAdmin inf: Lvm Admin:Master configure order AdminBeforeLvm inf: Admin:promote Lvm:start | |||||||
> > | # A score of "inf" means "infinity"; if the DRBD resource 'AdminClone' can't # be promoted, then don't start the 'FilesystemClone' resource. | |||||||
Changed: | ||||||||
< < | cib commit lvm | |||||||
> > | colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start | |||||||
Changed: | ||||||||
< < | # Go back to the actual, live configuration cib use live # See if everything is working configure show status # Go back to the shadow for more commands. cib use lvm # We have a whole bunch of filesystems on the "admin" volume group. Let's # create the commands to mount them. # The 'timeout="240s' piece is to give a four-minute interval to start # up the mount. This allows for a "it's been too long, do an fsck" check # on mounting the filesystem. # We also allow five minutes for the unmounting to stop, just in case # it's taking a while for some job on server to let go of the mount. # It's better that it take a while to switch over the system service # than for the mount to be forcibly terminated. configure primitive UsrDirectory ocf:heartbeat:Filesystem params device="/dev/admin/usr" directory="/usr/nevis" fstype="ext4" op start interval="0" timeout="240s" op stop interval="0" timeout="300s" configure primitive VarDirectory ocf:heartbeat:Filesystem params device="/dev/admin/var" directory="/var/nevis" fstype="ext4" op start interval="0" timeout="240s" op stop interval="0" timeout="300s" configure primitive MailDirectory ocf:heartbeat:Filesystem params device="/dev/admin/mail" directory="/mail" fstype="ext4" op start interval="0" timeout="240s" op stop interval="0" timeout="300s" configure primitive XenDirectory ocf:heartbeat:Filesystem params device="/dev/admin/xen" directory="/xen" fstype="ext4" op start interval="0" timeout="240s" op stop interval="0" timeout="300s" configure group AdminDirectoriesGroup UsrDirectory VarDirectory MailDirectory XenDirectory # We can't mount any of them until LVM is set up: configure colocation DirectoriesWithLVM inf: AdminDirectoriesGroup Lvm configure order LvmBeforeDirectories inf: Lvm AdminDirectoriesGroup cib commit lvm | |||||||
> > | cib commit disk | |||||||
quit # Some standard Linux services are under corosync's control. They depend on some or | ||||||||
Line: 511 to 432 | ||||||||
# /home/bin/nut.sh on both hypatia and orestes; there are appropriate links # to this script from the stonith/external directory. | ||||||||
Deleted: | ||||||||
< < | # By the way, I sent the script to Cluster Labs, who accepted it. # The next generation of their distribution will include the script. | |||||||
# The following commands implement the STONITH mechanism for our cluster: crm | ||||||||
Line: 549 to 467 | ||||||||
crm configure property stonith-enabled=true | ||||||||
Deleted: | ||||||||
< < | # At this point, the key elements of the high-availability configuration have # been set up. There is one non-critical frill: One node (probably hypatia) will be # running the important services, while the other node (probably orestes) would # be "twiddling its thumbs." Instead, let's have orestes do something useful: execute # condor jobs. # For orestes to do this, it requires the condor service. It also requires that # library:/usr/nevis is mounted, the same as every other batch machine on the # Nevis condor cluster. We can't use the automount daemon (amd) to do this for # us, the way we do on the other batch nodes; we have to make corosync do the # mounts. crm cib new condor # Mount library:/usr/nevis. A bit of a name confusion here: there's a /work # partition on the primary node, but the name 'LibraryOnWork" means that # the nfs-mount of /usr/nevis is located on the secondary or "work" node. configure primitive LibraryOnWork ocf:heartbeat:Filesystem params device="library:/usr/nevis" directory="/usr/nevis" fstype="nfs" # Corosync must not NFS-mount library:/usr/nevis on the system has already # mounted /usr/nevis directly as part of AdminDirectoriesGroup # described above. # Note that if there's only one node remaining in the high-availability # cluster, it will be running the resource AdminDirectoriesGroup, and # LibraryOnWork will never be started. This is fine; if there's only one # node left, I don't want it running condor jobs. configure colocation NoRemoteMountWithDirectories -inf: LibraryOnWork AdminDirectoriesGroup # Determine on which machine we mount library:/usr/nevis after the NFS # export of /usr/nevis has been set up. configure order NfsBeforeLibrary inf: Nfs LibraryOnWork # Define the IPs associated with the backup system, and group them together. # This is a non-critical definition, and I don't want to assign it until the more important # "secondary" resources have been set up. configure primitive Burr ocf:heartbeat:IPaddr2 params ip=129.236.252.10 cidr_netmask=32 op monitor interval=30s configure primitive BurrLocal ocf:heartbeat:IPaddr2 params ip=10.44.7.10 cidr_netmask=32 op monitor interval=30s configure group AssistantIPGroup Burr BurrLocal colocation AssistantWithLibrary inf: AssistantIPGroup LibraryOnWork order LibraryBeforeAssistant inf: LibraryOnWork AssistantIPGroup # The standard condor execution service. As with all the batch nodes, # I've already configured /etc/condor/condor_config.local and created # scratch directories in /data/condor. configure primitive Condor lsb:condor # If we're able mount library:/usr/nevis, then it's safe to start condor. # If we can't mount library:/usr/nevis, then condor will never be started. # (We stated above that AssistantIPGroup won't start until after LibraryOnWork). configure colocation CondorWithAssistant inf: Condor AssistantIPGroup configure order AssistantBeforeCondor inf: AssistantIPGroup Condor cib commit condor quit | |||||||
|
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Nevis particle-physics administrative cluster configuration | ||||||||
Line: 10 to 10 | ||||||||
Files | ||||||||
Changed: | ||||||||
< < | Key HA configuration files. Note: Even in an emergency, there's no reason to edit these files:
/etc/cluster/cluster.conf /etc/lvm/lvm.conf /etc/drbd.d/global_common.conf /etc/drbd.d/admin.res /home/bin/fence_nut.pl /etc/rc.d/rc.local /home/bin/recover-symlinks.sh /etc/rc.d/rc.fix-pacemaker-delay (on hypatia only) | |||||||
> > | Key HA configuration files. Note: Even in an emergency, there's no reason to edit these files!
/etc/cluster/cluster.conf /etc/lvm/lvm.conf /etc/drbd.d/global_common.conf /etc/drbd.d/admin.res /home/bin/fence_nut.pl /etc/rc.d/rc.local /home/bin/recover-symlinks.sh /etc/rc.d/rc.fix-pacemaker-delay (on hypatia only)
The links are to an external site, pastebin![]() http://www.pastbin.com and searching for wgseligman 20130103 . | |||||||
One-time set-up | ||||||||
Line: 30 to 31 | ||||||||
DRBD set-up | ||||||||
Changed: | ||||||||
< < | Edit /etc/drbd.d/global_common.conf and create /etc/drbd.d/admin.res. Then on hypatia : | |||||||
> > | Edit /etc/drbd.d/global_common.conf and create /etc/drbd.d/admin.res . Then on hypatia : | |||||||
/sbin/drbdadm create-md admin /sbin/service drbd start | ||||||||
Line: 69 to 70 | ||||||||
Here's![]() /proc/drbd . | ||||||||
Added: | ||||||||
> > | Clustered LVM setupThe following commands only have to be issued on one of the nodes.
| |||||||
CommandsThe configuration has definitely changed from that listed below. To see the current configuration, run this as root on eitherhypatia or orestes : |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Changed: | ||||||||
< < | Nevis particle-physics administrative cluster configuration | |||||||
> > | Nevis particle-physics administrative cluster configuration | |||||||
This is a reference page. It contains a text file that describes how the high-availability pacemaker![]() hypatia and orestes . | ||||||||
Line: 13 to 17 | ||||||||
/etc/drbd.d/global_common.conf /etc/drbd.d/admin.res /home/bin/fence_nut.pl | ||||||||
Added: | ||||||||
> > | /etc/rc.d/rc.local
/home/bin/recover-symlinks.sh
One-time set-upThe commands to set up a dual-primary cluster are outlined here. The details can be found in Clusters From Scratch![]() ![]() DRBD set-upEdit /etc/drbd.d/global_common.conf and create /etc/drbd.d/admin.res. Then onhypatia :
/sbin/drbdadm create-md admin /sbin/service drbd start /sbin/drbdadm up adminThen on orestes :
/sbin/drbdadm --force create-md admin /sbin/service drbd start /sbin/drbdadm up admin | |||||||
Added: | ||||||||
> > | Back to hypatia :
/sbin/drbdadm -- --overwrite-data-of-peer primary admin cat /proc/drbdKeeping looking at the contents of /proc/drbd . It will take a while, but eventually the two disks will sync up.
Back to orestes :
/sbin/drbdadm primary admin cat /proc/drbdThe result should be something like this: # cat /proc/drbd version: 8.4.1 (api:1/proto:86-100) GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by root@hypatia-tb.nevis.columbia.edu, 2012-02-14 17:04:51 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----- ns:162560777 nr:78408289 dw:240969067 dr:747326438 al:10050 bm:1583 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0Here's ![]() /proc/drbd . | |||||||
CommandsThe configuration has definitely changed from that listed below. To see the current configuration, run this as root on eitherhypatia or orestes : |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Added: | ||||||||
> > |
Nevis particle-physics administrative cluster configurationThis is a reference page. It contains a text file that describes how the high-availability pacemaker![]() hypatia and orestes .
FilesKey HA configuration files. Note: Even in an emergency, there's no reason to edit these files:/etc/cluster/cluster.conf /etc/lvm/lvm.conf /etc/drbd.d/global_common.conf /etc/drbd.d/admin.res /home/bin/fence_nut.pl CommandsThe configuration has definitely changed from that listed below. To see the current configuration, run this as root on eitherhypatia or orestes :
crm configure showTo get a constantly-updated display of the resource status, the following command is the corosync equivalent of "top" (use Ctrl-C to exit): crm_monYou can run the above commands via sudo, but you'll have to extend your path; e.g., export PATH=/sbin:/usr/sbin:${PATH} sudo crm_mon ConceptsThis may help as you work your way through the configuration:crm configure primitive IP ocf:heartbeat:IPaddr2 params ip=192.168.85.3 \ cidr_netmask=32 op monitor interval=30s # Which is composed of * crm ::= "cluster resource manager", the command we're executing * primitive ::= The type of resource object that we’re creating. * IP ::= Our name for the resource * IPaddr2 ::= The script to call * ocf ::= The standard it conforms to * ip=192.168.85.3 ::= Parameter(s) as name/value pairs * cidr_netmask ::= netmask; 32-bits means use this exact IP address * op ::== what follows are options * monitor interval=30s ::= check every 30 seconds that this resource is working # ... timeout = how to long wait before you assume a resource is dead.How to find out which scripts exist, that is, which resources can be controlled by the HA cluster: crm ra classesBased on the result, I looked at: crm ra list ocf heartbeatTo find out what IPaddr2 parameters I needed, I used: crm ra meta ocf:heartbeat:IPaddr2 ConfigurationThis work was done in Sep-2010, with major revisions for stability in Aug-2011. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them.# The commands ultimately used to configure the high-availability (HA) servers: # The beginning: make sure corosync is running on both hypatia and orestes: /sbin/service corosync start # The following line is needed because we have only two machines in # the HA cluster. crm configure property no-quorum-policy=ignore # We'll configure STONITH later (see below) crm configure property stonith-enabled=false # Define IP addresses to be managed by the HA systems. crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 params ip=129.236.252.11 \ cidr_netmask=32 op monitor interval=30s crm configure primitive LocalIP ocf:heartbeat:IPaddr2 params ip=10.44.7.11 \ cidr_netmask=32 op monitor interval=30s crm configure primitive SandboxIP ocf:heartbeat:IPaddr2 params ip=10.43.7.11 \ cidr_netmask=32 op monitor interval=30s # Group these together, so they'll all be assigned to the same machine. # The name of the group is "MainIPGroup". crm configure group MainIPGroup ClusterIP LocalIP SandboxIP # Let's continue by entering the crm utility for short sessions. I'm going to # test groups of commands before I commit them. (I omit the "configure show' # and "status" commands that I frequently typed in, in order to see that # everything was correct.) # DRBD is a service that synchronizes the hard drives between two machines. # For our cluster, one machine will have access to the "master" copy # and make all the changes to that copy; the other machine will have the # "slave" copy and mindlessly duplicate all the changes. # I previously configured the DRBD resources 'admin' and 'work'. What the # following commands do is put the maintenance of these resources under # the control of Pacemaker. crm # Define a "shadow" configuration, to test things without committing them # to the HA cluster: cib new drbd # The "drbd_resource" parameter points to a configuration defined in /etc/drbd.d/ configure primitive AdminDrbd ocf:linbit:drbd params drbd_resource=admin op monitor interval=60s # DRBD functions with a "master/slave" setup as described above. The following command # defines the name of the master disk partition ("Admin"). The remaining parameters # clarify that there are two copies, but only one can be the master, and # at most one can be a slave. configure master Admin AdminDrbd meta master-max=1 master-node-max=1 \ clone-max=2 clone-node-max=1 notify=true globally-unique=false # The machine that gets the master copy (the one that will make changes to the drive) # should also be the one with the main IP address. configure colocation AdminWithMainIP inf: MainIPGroup Admin:Master # We want to wait before assigning IPs to a node until we know that # Admin has been promoted to master on that node. configure order AdminBeforeMainIP inf: Admin:promote MainIPGroup # I like these commands, so commit them to the running configuration. cib commit drbd # Things look good, so let's add another disk resource. I defined another drbd resource # with some spare disk space, called "work". The idea is that I can play with alternate # virtual machines and save them on "work" before I copy them to the more robust "admin". configure primitive WorkDrbd ocf:linbit:drbd params drbd_resource=work op monitor interval=60s configure master Work WorkDrbd meta master-max=1 master-node-max=1 \ clone-max=2 clone-node-max=1 notify=true globally-unique=false # I prefer the work directory to be on the main admin box, but it doesn't have to be. "500:" is # weighting factor; compare it to "inf:" (for infinity) which is used in most of these commands. configure colocation WorkPrefersMain 500: Work:Master MainIPGroup # Given a choice, try to put the Admin:Master on hypatia configure location DefinePreferredMainNode Admin 100: hypatia.nevis.columbia.edu cib commit drbd quit # Now try a resource that depends on ordering: On the node that has the master # resource for "work," mount that disk image as /work. crm cib new workdisk # To find out that there was an "ocf:heartbeat:Filesystem" that I could use, # I used the command: ra classes # Based on the result, I looked at: ra list ocf heartbeat # To find out what Filesystem parameters I needed, I used: ra meta ocf:heartbeat:Filesystem # All of the above led me to create the following resource configuration: configure primitive WorkDirectory ocf:heartbeat:Filesystem \ params device="/dev/drbd2" directory="/work" fstype="ext4" # Note that I had previously created an ext4 filesystem on /dev/drbd2. # Now specify that we want this to be on the same node as Work:Master: configure colocation DirectoryWithWork inf: WorkDirectory Work:Master # One more thing: It's important that we not try to mount the directory # until after Work has been promoted to master on the node. # A score of "inf" means "infinity"; if the DRBD resource 'work' can't # be set up, then don't mount the /work partition. configure order WorkBeforeDirectory inf: Work:promote WorkDirectory:start cib commit workdisk quit # We've made the relatively-unimportant DRBD resource 'work' function. Let's do it for 'admin'. # Previously I created some LVM volumes on the admin DRBD master. We need to use a # resource to active them, but we can't activate them until after the Admin:Master # is loaded. crm cib new lvm # Activate the LVM volumes, but only after DRBD has figured out where # Admin:Master is located. configure primitive Lvm ocf:heartbeat:LVM \ params volgrpname="admin" configure colocation LvmWithAdmin inf: Lvm Admin:Master configure order AdminBeforeLvm inf: Admin:promote Lvm:start cib commit lvm # Go back to the actual, live configuration cib use live # See if everything is working configure show status # Go back to the shadow for more commands. cib use lvm # We have a whole bunch of filesystems on the "admin" volume group. Let's # create the commands to mount them. # The 'timeout="240s' piece is to give a four-minute interval to start # up the mount. This allows for a "it's been too long, do an fsck" check # on mounting the filesystem. # We also allow five minutes for the unmounting to stop, just in case # it's taking a while for some job on server to let go of the mount. # It's better that it take a while to switch over the system service # than for the mount to be forcibly terminated. configure primitive UsrDirectory ocf:heartbeat:Filesystem \ params device="/dev/admin/usr" directory="/usr/nevis" fstype="ext4" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="300s" configure primitive VarDirectory ocf:heartbeat:Filesystem \ params device="/dev/admin/var" directory="/var/nevis" fstype="ext4" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="300s" configure primitive MailDirectory ocf:heartbeat:Filesystem \ params device="/dev/admin/mail" directory="/mail" fstype="ext4" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="300s" configure primitive XenDirectory ocf:heartbeat:Filesystem \ params device="/dev/admin/xen" directory="/xen" fstype="ext4" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="300s" configure group AdminDirectoriesGroup UsrDirectory VarDirectory MailDirectory XenDirectory # We can't mount any of them until LVM is set up: configure colocation DirectoriesWithLVM inf: AdminDirectoriesGroup Lvm configure order LvmBeforeDirectories inf: Lvm AdminDirectoriesGroup cib commit lvm quit # Some standard Linux services are under corosync's control. They depend on some or # all of the filesystems being mounted. # Let's start with a simple one: enable the printing service (cups): crm cib new printing # lsb = "Linux Standard Base." It just means any service which is # controlled by the one of the standard scripts in /etc/init.d configure primitive Cups lsb:cups # Cups stores its spool files in /var/spool/cups. If the cups service # were to switch to a different server, we want the new server to see # the spooled files. So create /var/nevis/cups, link it with: # mv /var/spool/cups /var/spool/cups.ori # ln -sf /var/nevis/cups /var/spool/cups # and demand that the cups service only start if /var/nevis (and the other # high-availability directories) have been mounted. configure colocation CupsWithVar inf: Cups AdminDirectoriesGroup # In order to prevent chaos, make sure that the high-availability directories # have been mounted before we try to start cups. configure order VarBeforeCups inf: AdminDirectoriesGroup Cups cib commit printing quit # The other services (xinetd, dhcpd) follow the same pattern as above: # Make sure the services start on the same machine as the admin directories, # and after the admin directories are successfully mounted. crm cib new services configure primitive Xinetd lsb:xinetd configure primitive Dhcpd lsb:dhcpd configure colocation XinetdWithVar inf: Xinetd AdminDirectoriesGroup configure order VarBeforeXinetd inf: VarDirectory Xinetd configure colocation DhcpdWithVar inf: Dhcpd AdminDirectoriesGroup configure order VarBeforeDhcpd inf: VarDirectory Dhcpd cib commit services quit # The high-availability servers export some of the admin directories to other # systems, both real and virtual; for example, the /usr/nevis directory is # exported to all the other machines on the Nevis Linux cluster. # NFS exporting of a shared directory can be a little tricky. As with CUPS # spooling, we want to preserve the NFS export state in a way that the # backup server can pick it up. The safest way to do this is to create a # small separate LVM partition ("nfs") and mount it as "/var/lib/nfs", # the NFS directory that contains files that keep track of the NFS state. crm cib new nfs # Define the mount for the NFS state directory /var/lib/nfs configure primitive NfsStateDirectory ocf:heartbeat:Filesystem \ params device="/dev/admin/nfs" directory="/var/lib/nfs" fstype="ext4" configure colocation NfsStateWithVar inf: NfsStateDirectory AdminDirectoriesGroup configure order VarBeforeNfsState inf: AdminDirectoriesGroup NfsStateDirectory # Now that the NFS state directory is mounted, we can start nfslockd. Note that # that we're starting NFS lock on both the primary and secondary HA systems; # by default a "clone" resource is started on all systems in a cluster. # (Placing nfslockd under the control of Pacemaker turned out to be key to # successful transfer of cluster services to another node. The nfslockd and # nfs daemon information stored in /var/lib/nfs have to be consistent.) configure primitive NfsLockInstance lsb:nfslock configure clone NfsLock NfsLockInstance configure order NfsStateBeforeNfsLock inf: NfsStateDirectory NfsLock # Once nfslockd has been set up, we can start NFS. (We say to colocate # NFS with 'NfsStateDirectory', instead of nfslockd, because nfslockd # is going to be started on both nodes.) configure primitive Nfs lsb:nfs configure colocation NfsWithNfsState inf: Nfs NfsStateDirectory configure order NfsLockBeforeNfs inf: NfsLock Nfs cib commit nfs quit # The whole point of the entire setup is to be able to run guest virtual machines # under the control of the high-availability service. Here is the set-up for one example # virtual machine. I previously created the hogwarts virtual machine and copied its # configuration to /xen/configs/hogwarts.cfg. # I duplicated the same procedure for franklin (mail server), ada (web server), and # so on, but I don't show that here. crm cib new hogwarts # Give the virtual machine a long stop interval before flagging an error. # Sometimes it takes a while for Linux to shut down. configure primitive Hogwarts ocf:heartbeat:Xen params \ xmfile="/xen/configs/Hogwarts.cfg" \ op stop interval="0" timeout="240" # All the virtual machine files are stored in the /xen partition, which is one # of the high-availability admin directories. The virtual machine must run on # the system with this directory. configure colocation HogwartsWithDirectories inf: Hogwarts AdminDirectoriesGroup # All of the virtual machines depend on NFS-mounting directories which # are exported by the HA server. The safest thing to do is to make sure # NFS is running on the HA server before starting the virtual machine. configure order NfsBeforeHogwarts inf: Nfs Hogwarts cib commit hogwarts quit # An important part of a high-availability configuration is STONITH = "Shoot the # other node in the head." Here's the idea: suppose one node fails for some reason. The # other node will take over as needed. # Suppose the failed node tries to come up again. This can be a problem: The other node # may have accumulated changes that the failed node doesn't know about. There can be # synchronization issues that require manual intervention. # The STONITH mechanism means: If a node fails, the remaining node(s) in a cluster will # force a permanent shutdown of the failed node; it can't automatically come back up again. # This is a special case of "fencing": once a node or resource fails, it can't be allowed # to start up again automatically. # In general, there are many ways to implement a STONITH mechanism. At Nevis, the way # we do it is to shut-off the power on the UPS connected to the failed node. # (By the way, this is why you have to be careful about restarting hypatia or orestes. # The STONITH mechanism may cause the UPS on the restarting # computer to turn off the power; it will never come back up.) # At Nevis, the UPSes are monitored and controlled using the NUT package # <http://www.networkupstools.org/>; details are on the Nevis wiki at # <http://www.nevis.columbia.edu/twiki/bin/view/Nevis/Ups>. # The official corosync distribution from <http://www.clusterlabs.org/> # does not include a script for NUT, so I had to write one. It's located at # /home/bin/nut.sh on both hypatia and orestes; there are appropriate links # to this script from the stonith/external directory. # By the way, I sent the script to Cluster Labs, who accepted it. # The next generation of their distribution will include the script. # The following commands implement the STONITH mechanism for our cluster: crm cib new stonith # The STONITH resource that can potentially shut down hypatia. configure primitive HypatiaStonith stonith:external/nut \ params hostname="hypatia.nevis.columbia.edu" \ ups="hypatia-ups" username="admin" password="acdc" # The node that runs the above script cannot be hypatia; it's # not wise to trust a node to STONITH itself. Note that the score # is "negative infinity," which means "never run this resource # on the named node." configure location HypatiaStonithLoc HypatiaStonith -inf: hypatia.nevis.columbia.edu # The STONITH resource that can potentially shut down orestes. configure primitive OrestesStonith stonith:external/nut \ params hostname="orestes.nevis.columbia.edu" \ ups="orestes-ups" username="admin" password="acdc" # Again, orestes cannot be the node that runs the above script. configure location OresetesStonithLoc OrestesStonith -inf: orestes.nevis.columbia.edu cib commit stonith quit # Now turn the STONITH mechanism on for the cluster. crm configure property stonith-enabled=true # At this point, the key elements of the high-availability configuration have # been set up. There is one non-critical frill: One node (probably hypatia) will be # running the important services, while the other node (probably orestes) would # be "twiddling its thumbs." Instead, let's have orestes do something useful: execute # condor jobs. # For orestes to do this, it requires the condor service. It also requires that # library:/usr/nevis is mounted, the same as every other batch machine on the # Nevis condor cluster. We can't use the automount daemon (amd) to do this for # us, the way we do on the other batch nodes; we have to make corosync do the # mounts. crm cib new condor # Mount library:/usr/nevis. A bit of a name confusion here: there's a /work # partition on the primary node, but the name 'LibraryOnWork" means that # the nfs-mount of /usr/nevis is located on the secondary or "work" node. configure primitive LibraryOnWork ocf:heartbeat:Filesystem \ params device="library:/usr/nevis" directory="/usr/nevis" \ fstype="nfs" # Corosync must not NFS-mount library:/usr/nevis on the system has already # mounted /usr/nevis directly as part of AdminDirectoriesGroup # described above. # Note that if there's only one node remaining in the high-availability # cluster, it will be running the resource AdminDirectoriesGroup, and # LibraryOnWork will never be started. This is fine; if there's only one # node left, I _don't_ want it running condor jobs. configure colocation NoRemoteMountWithDirectories -inf: LibraryOnWork AdminDirectoriesGroup # Determine on which machine we mount library:/usr/nevis _after_ the NFS # export of /usr/nevis has been set up. configure order NfsBeforeLibrary inf: Nfs LibraryOnWork # Define the IPs associated with the backup system, and group them together. # This is a non-critical definition, and I don't want to assign it until the more important # "secondary" resources have been set up. configure primitive Burr ocf:heartbeat:IPaddr2 params ip=129.236.252.10 \ cidr_netmask=32 op monitor interval=30s configure primitive BurrLocal ocf:heartbeat:IPaddr2 params ip=10.44.7.10 cidr_netmask=32 op monitor interval=30s configure group AssistantIPGroup Burr BurrLocal colocation AssistantWithLibrary inf: AssistantIPGroup LibraryOnWork order LibraryBeforeAssistant inf: LibraryOnWork AssistantIPGroup # The standard condor execution service. As with all the batch nodes, # I've already configured /etc/condor/condor_config.local and created # scratch directories in /data/condor. configure primitive Condor lsb:condor # If we're able mount library:/usr/nevis, then it's safe to start condor. # If we can't mount library:/usr/nevis, then condor will never be started. # (We stated above that AssistantIPGroup won't start until after LibraryOnWork). configure colocation CondorWithAssistant inf: Condor AssistantIPGroup configure order AssistantBeforeCondor inf: AssistantIPGroup Condor cib commit condor quit
|