Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Nevis particle-physics administrative cluster configuration | ||||||||
Added: | ||||||||
> > | Archived 20-Sep-2013: The high-availability cluster has been set aside in favor of a more traditional single-box admin server. HA is grand in theory, but in the three years we operated the cluster we had no hardware problems which the HA set-up would have prevented, but many hours of downtime due to problems with the HA software. This mailing-list post![]() | |||||||
This is a reference page. It contains a text file that describes how the high-availability Pacemaker/Corosync![]() hypatia and orestes .
Files |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Nevis particle-physics administrative cluster configuration | ||||||||
Line: 388 to 388 | ||||||||
# Give the virtual machine a long stop interval before flagging an error. # Sometimes it takes a while for Linux to shut down. | ||||||||
Changed: | ||||||||
< < | configure primitive Hogwarts ocf:heartbeat:Filesystem params | |||||||
> > | configure primitive Hogwarts ocf:heartbeat:Xen params | |||||||
xmfile="/xen/configs/Hogwarts.cfg" op stop interval="0" timeout="240" |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Nevis particle-physics administrative cluster configuration | ||||||||
Line: 114 to 114 | ||||||||
# and make all the changes to that copy; the other machine will have the # "slave" copy and mindlessly duplicate all the changes. | ||||||||
Added: | ||||||||
> > | # I previously configured the DRBD resources 'admin' and 'work'. What the # following commands do is put the maintenance of these resources under # the control of Pacemaker. | |||||||
crm # Define a "shadow" configuration, to test things without committing them # to the HA cluster: | ||||||||
Line: 376 to 380 | ||||||||
# configuration to /xen/configs/hogwarts.cfg. # I duplicated the same procedure for franklin (mail server), ada (web server), and | ||||||||
Changed: | ||||||||
< < | # so on, but I don't know that here. | |||||||
> > | # so on, but I don't show that here. | |||||||
crm cib new hogwarts |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Nevis particle-physics administrative cluster configuration | ||||||||
Line: 72 to 72 | ||||||||
Configuration | ||||||||
Changed: | ||||||||
< < | This work was done in Sep-2010. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. | |||||||
> > | This work was done in Sep-2010, with major revisions for stability in Aug-2011. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. | |||||||
# The commands ultimately used to configure the high-availability (HA) servers: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Nevis particle-physics administrative cluster configuration | ||||||||
Line: 136 to 136 | ||||||||
configure colocation AdminWithMainIP inf: MainIPGroup Admin:Master | ||||||||
Added: | ||||||||
> > | # We want to wait before assigning IPs to a node until we know that # Admin has been promoted to master on that node. configure order AdminBeforeMainIP inf: Admin:promote MainIPGroup | |||||||
# I like these commands, so commit them to the running configuration. cib commit drbd | ||||||||
Line: 191 to 195 | ||||||||
# One more thing: It's important that we not try to mount the directory # until after Work has been promoted to master on the node. | ||||||||
Added: | ||||||||
> > | # A score of "inf" means "infinity"; if the DRBD resource 'work' can't # be set up, then don't mount the /work partition. | |||||||
configure order WorkBeforeDirectory inf: Work:promote WorkDirectory:start cib commit workdisk quit | ||||||||
Changed: | ||||||||
< < | # We've made the relatively-unimportant work DRBD master function. Let's do it for real. # Prevously I created some LVM volumes on the admin DRBD master. We need to use a | |||||||
> > | # We've made the relatively-unimportant DRBD resource 'work' function. Let's do it for 'admin'. # Previously I created some LVM volumes on the admin DRBD master. We need to use a | |||||||
# resource to active them, but we can't activate them until after the Admin:Master # is loaded. crm | ||||||||
Line: 289 to 296 | ||||||||
# and demand that the cups service only start if /var/nevis (and the other # high-availability directories) have been mounted. | ||||||||
Deleted: | ||||||||
< < | # A score of "inf" means "infinity"; if it can't be run on the # machine that mounted all the admin directories, it won't run at all. | |||||||
configure colocation CupsWithVar inf: Cups AdminDirectoriesGroup # In order to prevent chaos, make sure that the high-availability directories | ||||||||
Line: 341 to 345 | ||||||||
configure colocation NfsStateWithVar inf: NfsStateDirectory AdminDirectoriesGroup configure order VarBeforeNfsState inf: AdminDirectoriesGroup NfsStateDirectory | ||||||||
Changed: | ||||||||
< < | # Now that the NFS state directory is mounted, we can start the nfslockd. Note that | |||||||
> > | # Now that the NFS state directory is mounted, we can start nfslockd. Note that | |||||||
# that we're starting NFS lock on both the primary and secondary HA systems; # by default a "clone" resource is started on all systems in a cluster. | ||||||||
Added: | ||||||||
> > | # (Placing nfslockd under the control of Pacemaker turned out to be key to # successful transfer of cluster services to another node. The nfslockd and # nfs daemon information stored in /var/lib/nfs have to be consistent.) | |||||||
configure primitive NfsLockInstance lsb:nfslock | ||||||||
Changed: | ||||||||
< < | clone NfsLock NfsLockInstance | |||||||
> > | configure clone NfsLock NfsLockInstance | |||||||
Changed: | ||||||||
< < | # Once nfslockd has been set up, we can start NFS. | |||||||
> > | configure order NfsStateBeforeNfsLock inf: NfsStateDirectory NfsLock # Once nfslockd has been set up, we can start NFS. (We say to colocate # NFS with 'NfsStateDirectory', instead of nfslockd, because nfslockd # is going to be started on both nodes.) | |||||||
configure primitive Nfs lsb:nfs configure colocation NfsWithNfsState inf: Nfs NfsStateDirectory | ||||||||
Changed: | ||||||||
< < | configure order NfsStateBeforeNfs inf: NfsStateDirectory Nfs | |||||||
> > | configure order NfsLockBeforeNfs inf: NfsLock Nfs | |||||||
cib commit nfs quit | ||||||||
Line: 363 to 375 | ||||||||
# virtual machine. I previously created the hogwarts virtual machine and copied its # configuration to /xen/configs/hogwarts.cfg. | ||||||||
Added: | ||||||||
> > | # I duplicated the same procedure for franklin (mail server), ada (web server), and # so on, but I don't know that here. | |||||||
crm cib new hogwarts | ||||||||
Line: 405 to 420 | ||||||||
# In general, there are many ways to implement a STONITH mechanism. At Nevis, the way # we do it is to shut-off the power on the UPS connected to the failed node. | ||||||||
Changed: | ||||||||
< < | # (By the way, this is why you have to restart hypatia and orestes at the same time. # If you just restart one, the STONITH mechanism will cause the UPS on the restarting | |||||||
> > | # (By the way, this is why you have to be careful about restarting hypatia or orestes. # The STONITH mechanism may cause the UPS on the restarting | |||||||
# computer to turn off the power; it will never come back up.) # At Nevis, the UPSes are monitored and controlled using the NUT package | ||||||||
Line: 466 to 481 | ||||||||
# For orestes to do this, it requires the condor service. It also requires that # library:/usr/nevis is mounted, the same as every other batch machine on the # Nevis condor cluster. We can't use the automount daemon (amd) to do this for | ||||||||
Changed: | ||||||||
< < | # us, the way we do on the other batch nodes, so we have to make corosync do the | |||||||
> > | # us, the way we do on the other batch nodes; we have to make corosync do the | |||||||
# mounts. crm cib new condor | ||||||||
Changed: | ||||||||
< < | # Mount library:/usr/nevis | |||||||
> > | # Mount library:/usr/nevis. A bit of a name confusion here: there's a /work # partition on the primary node, but the name 'LibraryOnWork" means that # the nfs-mount of /usr/nevis is located on the secondary or "work" node. | |||||||
configure primitive LibraryOnWork ocf:heartbeat:Filesystem params device="library:/usr/nevis" directory="/usr/nevis" | ||||||||
Changed: | ||||||||
< < | fstype="nfs" OCF_CHECK_LEVEL="20" | |||||||
> > | fstype="nfs" | |||||||
Changed: | ||||||||
< < | # Corosync must NOT mount library:/usr/nevis on the system has already | |||||||
> > | # Corosync must not NFS-mount library:/usr/nevis on the system has already | |||||||
# mounted /usr/nevis directly as part of AdminDirectoriesGroup # described above. | ||||||||
Line: 489 to 506 | ||||||||
configure colocation NoRemoteMountWithDirectories -inf: LibraryOnWork AdminDirectoriesGroup | ||||||||
Changed: | ||||||||
< < | # Determine on which machine we mount library:/usr/nevis after we # figure out which machine is running AdminDirectoriesGroup. | |||||||
> > | # Determine on which machine we mount library:/usr/nevis after the NFS # export of /usr/nevis has been set up. | |||||||
Changed: | ||||||||
< < | configure order DirectoresBeforeLibrary inf: AdminDirectoriesGroup LibraryOnWork | |||||||
> > | configure order NfsBeforeLibrary inf: Nfs LibraryOnWork | |||||||
# Define the IPs associated with the backup system, and group them together. # This is a non-critical definition, and I don't want to assign it until the more important | ||||||||
Line: 515 to 532 | ||||||||
# If we're able mount library:/usr/nevis, then it's safe to start condor. # If we can't mount library:/usr/nevis, then condor will never be started. | ||||||||
Added: | ||||||||
> > | # (We stated above that AssistantIPGroup won't start until after LibraryOnWork). | |||||||
Changed: | ||||||||
< < | configure colocation CondorWithLibrary inf: Condor LibraryOnWork # library:/usr/nevis must be mounted before condor starts. configure order LibraryBeforeCondor inf: LibraryOnWork Condor | |||||||
> > | configure colocation CondorWithAssistant inf: Condor AssistantIPGroup configure order AssistantBeforeCondor inf: AssistantIPGroup Condor | |||||||
cib commit condor quit |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Nevis particle-physics administrative cluster configuration | ||||||||
Line: 108 to 108 | ||||||||
# test groups of commands before I commit them. (I omit the "configure show' # and "status" commands that I frequently typed in, in order to see that # everything was correct.) | ||||||||
Deleted: | ||||||||
< < | crm # Define a "shadow" configuration, to test things without commiting them # to the HA cluster: cib new ip # Define the IPs associated with the backup system, and group them together. configure primitive AssistantIP ocf:heartbeat:IPaddr2 params ip=129.236.252.10 cidr_netmask=32 op monitor interval=30s configure primitive AssistantLocalIP ocf:heartbeat:IPaddr2 params ip=10.44.7.10 cidr_netmask=32 op monitor interval=30s configure group AssistantIPGroup AssistantIP AssistantLocalIP # Define a "colocation" = how much do you want these things together? # A score of -1000 means to try to keep them on separate machines as # much as possible, but allow them on the same machine if necessary. configure colocation SeparateIPs -1000: MainIPGroup AssistantIPGroup # I like these commands, so commit them to the running configuration. cib commit ip quit | |||||||
# DRBD is a service that synchronizes the hard drives between two machines. # For our cluster, one machine will have access to the "master" copy | ||||||||
Line: 137 to 115 | ||||||||
# "slave" copy and mindlessly duplicate all the changes. crm | ||||||||
Added: | ||||||||
> > | # Define a "shadow" configuration, to test things without committing them # to the HA cluster: | |||||||
cib new drbd # The "drbd_resource" parameter points to a configuration defined in /etc/drbd.d/ | ||||||||
Line: 154 to 134 | ||||||||
# The machine that gets the master copy (the one that will make changes to the drive) # should also be the one with the main IP address. | ||||||||
Changed: | ||||||||
< < | configure colocation AdminWithMainIP inf: Admin:Master MainIPGroup | |||||||
> > | configure colocation AdminWithMainIP inf: MainIPGroup Admin:Master # I like these commands, so commit them to the running configuration. | |||||||
cib commit drbd | ||||||||
Line: 166 to 148 | ||||||||
configure master Work WorkDrbd meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true globally-unique=false | ||||||||
Changed: | ||||||||
< < | # I prefer the work directory to be on the main admin box, but it doesn't have to be. | |||||||
> > | # I prefer the work directory to be on the main admin box, but it doesn't have to be. "500:" is # weighting factor; compare it to "inf:" (for infinity) which is used in most of these commands. | |||||||
configure colocation WorkPrefersMain 500: Work:Master MainIPGroup | ||||||||
Line: 358 to 341 | ||||||||
configure colocation NfsStateWithVar inf: NfsStateDirectory AdminDirectoriesGroup configure order VarBeforeNfsState inf: AdminDirectoriesGroup NfsStateDirectory | ||||||||
Changed: | ||||||||
< < | # Once that directory has been set up, we can start NFS. | |||||||
> > | # Now that the NFS state directory is mounted, we can start the nfslockd. Note that # that we're starting NFS lock on both the primary and secondary HA systems; # by default a "clone" resource is started on all systems in a cluster. configure primitive NfsLockInstance lsb:nfslock clone NfsLock NfsLockInstance # Once nfslockd has been set up, we can start NFS. | |||||||
configure primitive Nfs lsb:nfs configure colocation NfsWithNfsState inf: Nfs NfsStateDirectory | ||||||||
Line: 504 to 494 | ||||||||
configure order DirectoresBeforeLibrary inf: AdminDirectoriesGroup LibraryOnWork | ||||||||
Added: | ||||||||
> > | # Define the IPs associated with the backup system, and group them together. # This is a non-critical definition, and I don't want to assign it until the more important # "secondary" resources have been set up. configure primitive Burr ocf:heartbeat:IPaddr2 params ip=129.236.252.10 cidr_netmask=32 op monitor interval=30s configure primitive BurrLocal ocf:heartbeat:IPaddr2 params ip=10.44.7.10 cidr_netmask=32 op monitor interval=30s configure group AssistantIPGroup Burr BurrLocal colocation AssistantWithLibrary inf: AssistantIPGroup LibraryOnWork order LibraryBeforeAssistant inf: LibraryOnWork AssistantIPGroup | |||||||
# The standard condor execution service. As with all the batch nodes, # I've already configured /etc/condor/condor_config.local and created # scratch directories in /data/condor. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Nevis particle-physics administrative cluster configuration | ||||||||
Line: 22 to 22 | ||||||||
crm configure show | ||||||||
Changed: | ||||||||
< < | To get a constantly-updated display of the configuration, the following command is the corosync equivalent of "top" (use Ctrl-C to exit): | |||||||
> > | To get a constantly-updated display of the resource status, the following command is the corosync equivalent of "top" (use Ctrl-C to exit): | |||||||
crm_mon | ||||||||
Line: 57 to 57 | ||||||||
# ... timeout = how to long wait before you assume a resource is dead. | ||||||||
Added: | ||||||||
> > | How to find out which scripts exist, that is, which resources can be controlled by the HA cluster:
ra classesBased on the result, I looked at: ra list ocf heartbeatTo find out what IPaddr2 parameters I needed, I used: ra meta ocf:heartbeat:IPaddr2 | |||||||
ConfigurationThis work was done in Sep-2010. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. | ||||||||
Added: | ||||||||
> > | # The commands ultimately used to configure the high-availability (HA) servers: | |||||||
# The beginning: make sure corosync is running on both hypatia and orestes: /sbin/service corosync start | ||||||||
Line: 116 to 131 | ||||||||
cib commit ip quit | ||||||||
Changed: | ||||||||
< < | # DRBD is a service that syncronizes the hard drives between two machines. | |||||||
> > | # DRBD is a service that synchronizes the hard drives between two machines. | |||||||
# For our cluster, one machine will have access to the "master" copy # and make all the changes to that copy; the other machine will have the # "slave" copy and mindlessly duplicate all the changes. | ||||||||
Line: 162 to 177 | ||||||||
cib commit drbd quit | ||||||||
Changed: | ||||||||
< < | # Now try a resource that depends on ordering: On the node that's has the master # resource for "work," mount that disk image on as /work. | |||||||
> > | # Now try a resource that depends on ordering: On the node that has the master # resource for "work," mount that disk image as /work. | |||||||
crm cib new workdisk | ||||||||
Line: 283 to 298 | ||||||||
configure primitive Cups lsb:cups | ||||||||
Changed: | ||||||||
< < | # The print server must be associated with the main IP address. # A score of "inf" means "infinity"; if it can't be run on the # machine that's offering the main IP address, it won't run at all. configure colocation CupsWithMainIP inf: Cups MainIPGroup # But that's not the only requirement. Cups stores its spool files in # /var/spool/cups. If the cups service were to switch to a different server, # we want the new server to see the spools files. So create /var/nevis/cups, # link it with: | |||||||
> > | # Cups stores its spool files in /var/spool/cups. If the cups service # were to switch to a different server, we want the new server to see # the spooled files. So create /var/nevis/cups, link it with: | |||||||
# mv /var/spool/cups /var/spool/cups.ori # ln -sf /var/nevis/cups /var/spool/cups # and demand that the cups service only start if /var/nevis (and the other # high-availability directories) have been mounted. | ||||||||
Added: | ||||||||
> > | # A score of "inf" means "infinity"; if it can't be run on the # machine that mounted all the admin directories, it won't run at all. | |||||||
configure colocation CupsWithVar inf: Cups AdminDirectoriesGroup # In order to prevent chaos, make sure that the high-availability directories | ||||||||
Line: 327 to 338 | ||||||||
cib commit services quit | ||||||||
Changed: | ||||||||
< < | # The high-availability servers export the /usr/nevis directory to all the # other machines on the Nevis Linux cluster. NFS exporting of a shared # directory can be a little tricky. As with CUPS spooling, we want to preserve # the NFS export state in a way that the backup server can pick it up. # The safest way to do this is to create a small separate LVM partition # ("nfs") and mount it as "/var/lib/nfs". | |||||||
> > | # The high-availability servers export some of the admin directories to other # systems, both real and virtual; for example, the /usr/nevis directory is # exported to all the other machines on the Nevis Linux cluster. # NFS exporting of a shared directory can be a little tricky. As with CUPS # spooling, we want to preserve the NFS export state in a way that the # backup server can pick it up. The safest way to do this is to create a # small separate LVM partition ("nfs") and mount it as "/var/lib/nfs", # the NFS directory that contains files that keep track of the NFS state. | |||||||
crm cib new nfs | ||||||||
Line: 354 to 368 | ||||||||
quit | ||||||||
Changed: | ||||||||
< < | # The whole point of this is to be able to run guest virtual machines under the # control of the high-availability service. Here is the set-up for one example | |||||||
> > | # The whole point of the entire setup is to be able to run guest virtual machines # under the control of the high-availability service. Here is the set-up for one example | |||||||
# virtual machine. I previously created the hogwarts virtual machine and copied its # configuration to /xen/configs/hogwarts.cfg. crm cib new hogwarts | ||||||||
Added: | ||||||||
> > | # Give the virtual machine a long stop interval before flagging an error. # Sometimes it takes a while for Linux to shut down. configure primitive Hogwarts ocf:heartbeat:Filesystem params xmfile="/xen/configs/Hogwarts.cfg" op stop interval="0" timeout="240" | |||||||
# All the virtual machine files are stored in the /xen partition, which is one | ||||||||
Changed: | ||||||||
< < | # of the high-availability admin directories. Make sure the directory is mounted # before starting the virtual machine. | |||||||
> > | # of the high-availability admin directories. The virtual machine must run on # the system with this directory. | |||||||
Deleted: | ||||||||
< < | configure primitive Hogwarts ocf:heartbeat:Filesystem params xmfile="/xen/configs/Hogwarts.cfg" | |||||||
configure colocation HogwartsWithDirectories inf: Hogwarts AdminDirectoriesGroup | ||||||||
Changed: | ||||||||
< < | configure order DirectoriesBeforeHogwarts inf: AdminDirectoriesGroup Hogwarts | |||||||
> > | # All of the virtual machines depend on NFS-mounting directories which # are exported by the HA server. The safest thing to do is to make sure # NFS is running on the HA server before starting the virtual machine. configure order NfsBeforeHogwarts inf: Nfs Hogwarts | |||||||
cib commit hogwarts quit | ||||||||
Line: 384 to 409 | ||||||||
# The STONITH mechanism means: If a node fails, the remaining node(s) in a cluster will # force a permanent shutdown of the failed node; it can't automatically come back up again. | ||||||||
Changed: | ||||||||
< < | # This also known as "fencing": once a node fails, it can't be allowed to re-join the # cluster. | |||||||
> > | # This is a special case of "fencing": once a node or resource fails, it can't be allowed # to start up again automatically. | |||||||
# In general, there are many ways to implement a STONITH mechanism. At Nevis, the way # we do it is to shut-off the power on the UPS connected to the failed node. | ||||||||
Line: 475 to 500 | ||||||||
configure colocation NoRemoteMountWithDirectories -inf: LibraryOnWork AdminDirectoriesGroup # Determine on which machine we mount library:/usr/nevis after we | ||||||||
Changed: | ||||||||
< < | # figure out which machine is running AdminDirectoriesGroup. "symmetrical=false" # means that if we're turning off the resource for some reason, we don't # have to wait for LibraryOnWork to be stopped before we try to stop # AdminDirectoriesGroup (since these resources always run on different machines). | |||||||
> > | # figure out which machine is running AdminDirectoriesGroup. | |||||||
Changed: | ||||||||
< < | configure order DirectoresBeforeLibrary inf: AdminDirectoriesGroup LibraryOnWork symmetrical=false | |||||||
> > | configure order DirectoresBeforeLibrary inf: AdminDirectoriesGroup LibraryOnWork | |||||||
# The standard condor execution service. As with all the batch nodes, # I've already configured /etc/condor/condor_config.local and created |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Nevis particle-physics administrative cluster configurationThis is a reference page. It contains a text file that describes how the high-availability Pacemaker/Corosync![]() hypatia and orestes . | ||||||||
Changed: | ||||||||
< < | This may help as you work your way through the configuration: | |||||||
> > | FilesKey HA configuration files. Note: Even in an emergency, there's no reason to edit these files:/etc/drbd.conf /etc/drbd.d/*.res /etc/lvm/lvm.conf /etc/corosync/corosync.conf /home/bin/nut.sh /home/bin/rsync-config.sh # Daily rsync from hypatia to orestes: Commands | |||||||
Added: | ||||||||
> > | The configuration has definitely changed from that listed below. To see the current configuration, run this as root on either hypatia or orestes :
crm configure showTo get a constantly-updated display of the configuration, the following command is the corosync equivalent of "top" (use Ctrl-C to exit): crm_monFor a GUI, you can use this utility. You have to select "Connect" and login via an account that's a member of the haclient group; you may have to edit /etc/group . | |||||||
Changed: | ||||||||
< < | # Concepts: crm configure primitive IP ocf:heartbeat:IPaddr2 params ip=192.168.85.3 | |||||||
> > | crm_gui &
You can run the above commands via sudo, but you'll have to extend your path; e.g.,
export PATH=/sbin:/usr/sbin:${PATH} sudo crm_mon ConceptsThis may help as you work your way through the configuration:crm configure primitive IP ocf:heartbeat:IPaddr2 params ip=192.168.85.3 \ | |||||||
cidr_netmask=32 op monitor interval=30s # Which is composed of | ||||||||
Changed: | ||||||||
< < | * crm ::= "corosync resource manager", the command we're executing | |||||||
> > | * crm ::= "cluster resource manager", the command we're executing | |||||||
* primitive ::= The type of resource object that we’re creating. * IP ::= Our name for the resource * IPaddr2 ::= The script to call | ||||||||
Line: 25 to 57 | ||||||||
# ... timeout = how to long wait before you assume a resource is dead. | ||||||||
Added: | ||||||||
> > | Configuration | |||||||
This work was done in Sep-2010. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Added: | ||||||||
> > |
Nevis particle-physics administrative cluster configurationThis is a reference page. It contains a text file that describes how the high-availability Pacemaker/Corosync![]() hypatia and orestes .
This may help as you work your way through the configuration:
# Concepts: crm configure primitive IP ocf:heartbeat:IPaddr2 params ip=192.168.85.3 \ cidr_netmask=32 op monitor interval=30s # Which is composed of * crm ::= "corosync resource manager", the command we're executing * primitive ::= The type of resource object that we’re creating. * IP ::= Our name for the resource * IPaddr2 ::= The script to call * ocf ::= The standard it conforms to * ip=192.168.85.3 ::= Parameter(s) as name/value pairs * cidr_netmask ::= netmask; 32-bits means use this exact IP address * op ::== what follows are options * monitor interval=30s ::= check every 30 seconds that this resource is working # ... timeout = how to long wait before you assume a resource is dead.This work was done in Sep-2010. The configuration has almost certainly changed since then. Hopefully, the following commands and comments will guide you to understanding any future changes and the reasons for them. # The beginning: make sure corosync is running on both hypatia and orestes: /sbin/service corosync start # The following line is needed because we have only two machines in # the HA cluster. crm configure property no-quorum-policy=ignore # We'll configure STONITH later (see below) crm configure property stonith-enabled=false # Define IP addresses to be managed by the HA systems. crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 params ip=129.236.252.11 \ cidr_netmask=32 op monitor interval=30s crm configure primitive LocalIP ocf:heartbeat:IPaddr2 params ip=10.44.7.11 \ cidr_netmask=32 op monitor interval=30s crm configure primitive SandboxIP ocf:heartbeat:IPaddr2 params ip=10.43.7.11 \ cidr_netmask=32 op monitor interval=30s # Group these together, so they'll all be assigned to the same machine. # The name of the group is "MainIPGroup". crm configure group MainIPGroup ClusterIP LocalIP SandboxIP # Let's continue by entering the crm utility for short sessions. I'm going to # test groups of commands before I commit them. (I omit the "configure show' # and "status" commands that I frequently typed in, in order to see that # everything was correct.) crm # Define a "shadow" configuration, to test things without commiting them # to the HA cluster: cib new ip # Define the IPs associated with the backup system, and group them together. configure primitive AssistantIP ocf:heartbeat:IPaddr2 params ip=129.236.252.10 \ cidr_netmask=32 op monitor interval=30s configure primitive AssistantLocalIP ocf:heartbeat:IPaddr2 params ip=10.44.7.10 cidr_netmask=32 op monitor interval=30s configure group AssistantIPGroup AssistantIP AssistantLocalIP # Define a "colocation" = how much do you want these things together? # A score of -1000 means to try to keep them on separate machines as # much as possible, but allow them on the same machine if necessary. configure colocation SeparateIPs -1000: MainIPGroup AssistantIPGroup # I like these commands, so commit them to the running configuration. cib commit ip quit # DRBD is a service that syncronizes the hard drives between two machines. # For our cluster, one machine will have access to the "master" copy # and make all the changes to that copy; the other machine will have the # "slave" copy and mindlessly duplicate all the changes. crm cib new drbd # The "drbd_resource" parameter points to a configuration defined in /etc/drbd.d/ configure primitive AdminDrbd ocf:linbit:drbd params drbd_resource=admin op monitor interval=60s # DRBD functions with a "master/slave" setup as described above. The following command # defines the name of the master disk partition ("Admin"). The remaining parameters # clarify that there are two copies, but only one can be the master, and # at most one can be a slave. configure master Admin AdminDrbd meta master-max=1 master-node-max=1 \ clone-max=2 clone-node-max=1 notify=true globally-unique=false # The machine that gets the master copy (the one that will make changes to the drive) # should also be the one with the main IP address. configure colocation AdminWithMainIP inf: Admin:Master MainIPGroup cib commit drbd # Things look good, so let's add another disk resource. I defined another drbd resource # with some spare disk space, called "work". The idea is that I can play with alternate # virtual machines and save them on "work" before I copy them to the more robust "admin". configure primitive WorkDrbd ocf:linbit:drbd params drbd_resource=work op monitor interval=60s configure master Work WorkDrbd meta master-max=1 master-node-max=1 \ clone-max=2 clone-node-max=1 notify=true globally-unique=false # I prefer the work directory to be on the main admin box, but it doesn't have to be. configure colocation WorkPrefersMain 500: Work:Master MainIPGroup # Given a choice, try to put the Admin:Master on hypatia configure location DefinePreferredMainNode Admin 100: hypatia.nevis.columbia.edu cib commit drbd quit # Now try a resource that depends on ordering: On the node that's has the master # resource for "work," mount that disk image on as /work. crm cib new workdisk # To find out that there was an "ocf:heartbeat:Filesystem" that I could use, # I used the command: ra classes # Based on the result, I looked at: ra list ocf heartbeat # To find out what Filesystem parameters I needed, I used: ra meta ocf:heartbeat:Filesystem # All of the above led me to create the following resource configuration: configure primitive WorkDirectory ocf:heartbeat:Filesystem \ params device="/dev/drbd2" directory="/work" fstype="ext4" # Note that I had previously created an ext4 filesystem on /dev/drbd2. # Now specify that we want this to be on the same node as Work:Master: configure colocation DirectoryWithWork inf: WorkDirectory Work:Master # One more thing: It's important that we not try to mount the directory # until after Work has been promoted to master on the node. configure order WorkBeforeDirectory inf: Work:promote WorkDirectory:start cib commit workdisk quit # We've made the relatively-unimportant work DRBD master function. Let's do it for real. # Prevously I created some LVM volumes on the admin DRBD master. We need to use a # resource to active them, but we can't activate them until after the Admin:Master # is loaded. crm cib new lvm # Activate the LVM volumes, but only after DRBD has figured out where # Admin:Master is located. configure primitive Lvm ocf:heartbeat:LVM \ params volgrpname="admin" configure colocation LvmWithAdmin inf: Lvm Admin:Master configure order AdminBeforeLvm inf: Admin:promote Lvm:start cib commit lvm # Go back to the actual, live configuration cib use live # See if everything is working configure show status # Go back to the shadow for more commands. cib use lvm # We have a whole bunch of filesystems on the "admin" volume group. Let's # create the commands to mount them. # The 'timeout="240s' piece is to give a four-minute interval to start # up the mount. This allows for a "it's been too long, do an fsck" check # on mounting the filesystem. # We also allow five minutes for the unmounting to stop, just in case # it's taking a while for some job on server to let go of the mount. # It's better that it take a while to switch over the system service # than for the mount to be forcibly terminated. configure primitive UsrDirectory ocf:heartbeat:Filesystem \ params device="/dev/admin/usr" directory="/usr/nevis" fstype="ext4" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="300s" configure primitive VarDirectory ocf:heartbeat:Filesystem \ params device="/dev/admin/var" directory="/var/nevis" fstype="ext4" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="300s" configure primitive MailDirectory ocf:heartbeat:Filesystem \ params device="/dev/admin/mail" directory="/mail" fstype="ext4" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="300s" configure primitive XenDirectory ocf:heartbeat:Filesystem \ params device="/dev/admin/xen" directory="/xen" fstype="ext4" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="300s" configure group AdminDirectoriesGroup UsrDirectory VarDirectory MailDirectory XenDirectory # We can't mount any of them until LVM is set up: configure colocation DirectoriesWithLVM inf: AdminDirectoriesGroup Lvm configure order LvmBeforeDirectories inf: Lvm AdminDirectoriesGroup cib commit lvm quit # Some standard Linux services are under corosync's control. They depend on some or # all of the filesystems being mounted. # Let's start with a simple one: enable the printing service (cups): crm cib new printing # lsb = "Linux Standard Base." It just means any service which is # controlled by the one of the standard scripts in /etc/init.d configure primitive Cups lsb:cups # The print server _must_ be associated with the main IP address. # A score of "inf" means "infinity"; if it can't be run on the # machine that's offering the main IP address, it won't run at all. configure colocation CupsWithMainIP inf: Cups MainIPGroup # But that's not the only requirement. Cups stores its spool files in # /var/spool/cups. If the cups service were to switch to a different server, # we want the new server to see the spools files. So create /var/nevis/cups, # link it with: # mv /var/spool/cups /var/spool/cups.ori # ln -sf /var/nevis/cups /var/spool/cups # and demand that the cups service only start if /var/nevis (and the other # high-availability directories) have been mounted. configure colocation CupsWithVar inf: Cups AdminDirectoriesGroup # In order to prevent chaos, make sure that the high-availability directories # have been mounted before we try to start cups. configure order VarBeforeCups inf: AdminDirectoriesGroup Cups cib commit printing quit # The other services (xinetd, dhcpd) follow the same pattern as above: # Make sure the services start on the same machine as the admin directories, # and after the admin directories are successfully mounted. crm cib new services configure primitive Xinetd lsb:xinetd configure primitive Dhcpd lsb:dhcpd configure colocation XinetdWithVar inf: Xinetd AdminDirectoriesGroup configure order VarBeforeXinetd inf: VarDirectory Xinetd configure colocation DhcpdWithVar inf: Dhcpd AdminDirectoriesGroup configure order VarBeforeDhcpd inf: VarDirectory Dhcpd cib commit services quit # The high-availability servers export the /usr/nevis directory to all the # other machines on the Nevis Linux cluster. NFS exporting of a shared # directory can be a little tricky. As with CUPS spooling, we want to preserve # the NFS export state in a way that the backup server can pick it up. # The safest way to do this is to create a small separate LVM partition # ("nfs") and mount it as "/var/lib/nfs". crm cib new nfs # Define the mount for the NFS state directory /var/lib/nfs configure primitive NfsStateDirectory ocf:heartbeat:Filesystem \ params device="/dev/admin/nfs" directory="/var/lib/nfs" fstype="ext4" configure colocation NfsStateWithVar inf: NfsStateDirectory AdminDirectoriesGroup configure order VarBeforeNfsState inf: AdminDirectoriesGroup NfsStateDirectory # Once that directory has been set up, we can start NFS. configure primitive Nfs lsb:nfs configure colocation NfsWithNfsState inf: Nfs NfsStateDirectory configure order NfsStateBeforeNfs inf: NfsStateDirectory Nfs cib commit nfs quit # The whole point of this is to be able to run guest virtual machines under the # control of the high-availability service. Here is the set-up for one example # virtual machine. I previously created the hogwarts virtual machine and copied its # configuration to /xen/configs/hogwarts.cfg. crm cib new hogwarts # All the virtual machine files are stored in the /xen partition, which is one # of the high-availability admin directories. Make sure the directory is mounted # before starting the virtual machine. configure primitive Hogwarts ocf:heartbeat:Filesystem params xmfile="/xen/configs/Hogwarts.cfg" configure colocation HogwartsWithDirectories inf: Hogwarts AdminDirectoriesGroup configure order DirectoriesBeforeHogwarts inf: AdminDirectoriesGroup Hogwarts cib commit hogwarts quit # An important part of a high-availability configuration is STONITH = "Shoot the # other node in the head." Here's the idea: suppose one node fails for some reason. The # other node will take over as needed. # Suppose the failed node tries to come up again. This can be a problem: The other node # may have accumulated changes that the failed node doesn't know about. There can be # synchronization issues that require manual intervention. # The STONITH mechanism means: If a node fails, the remaining node(s) in a cluster will # force a permanent shutdown of the failed node; it can't automatically come back up again. # This also known as "fencing": once a node fails, it can't be allowed to re-join the # cluster. # In general, there are many ways to implement a STONITH mechanism. At Nevis, the way # we do it is to shut-off the power on the UPS connected to the failed node. # (By the way, this is why you have to restart hypatia and orestes at the same time. # If you just restart one, the STONITH mechanism will cause the UPS on the restarting # computer to turn off the power; it will never come back up.) # At Nevis, the UPSes are monitored and controlled using the NUT package # <http://www.networkupstools.org/>; details are on the Nevis wiki at # <http://www.nevis.columbia.edu/twiki/bin/view/Nevis/Ups>. # The official corosync distribution from <http://www.clusterlabs.org/> # does not include a script for NUT, so I had to write one. It's located at # /home/bin/nut.sh on both hypatia and orestes; there are appropriate links # to this script from the stonith/external directory. # By the way, I sent the script to Cluster Labs, who accepted it. # The next generation of their distribution will include the script. # The following commands implement the STONITH mechanism for our cluster: crm cib new stonith # The STONITH resource that can potentially shut down hypatia. configure primitive HypatiaStonith stonith:external/nut \ params hostname="hypatia.nevis.columbia.edu" \ ups="hypatia-ups" username="admin" password="acdc" # The node that runs the above script cannot be hypatia; it's # not wise to trust a node to STONITH itself. Note that the score # is "negative infinity," which means "never run this resource # on the named node." configure location HypatiaStonithLoc HypatiaStonith -inf: hypatia.nevis.columbia.edu # The STONITH resource that can potentially shut down orestes. configure primitive OrestesStonith stonith:external/nut \ params hostname="orestes.nevis.columbia.edu" \ ups="orestes-ups" username="admin" password="acdc" # Again, orestes cannot be the node that runs the above script. configure location OresetesStonithLoc OrestesStonith -inf: orestes.nevis.columbia.edu cib commit stonith quit # Now turn the STONITH mechanism on for the cluster. crm configure property stonith-enabled=true # At this point, the key elements of the high-availability configuration have # been set up. There is one non-critical frill: One node (probably hypatia) will be # running the important services, while the other node (probably orestes) would # be "twiddling its thumbs." Instead, let's have orestes do something useful: execute # condor jobs. # For orestes to do this, it requires the condor service. It also requires that # library:/usr/nevis is mounted, the same as every other batch machine on the # Nevis condor cluster. We can't use the automount daemon (amd) to do this for # us, the way we do on the other batch nodes, so we have to make corosync do the # mounts. crm cib new condor # Mount library:/usr/nevis configure primitive LibraryOnWork ocf:heartbeat:Filesystem \ params device="library:/usr/nevis" directory="/usr/nevis" \ fstype="nfs" OCF_CHECK_LEVEL="20" # Corosync must NOT mount library:/usr/nevis on the system has already # mounted /usr/nevis directly as part of AdminDirectoriesGroup # described above. # Note that if there's only one node remaining in the high-availability # cluster, it will be running the resource AdminDirectoriesGroup, and # LibraryOnWork will never be started. This is fine; if there's only one # node left, I _don't_ want it running condor jobs. configure colocation NoRemoteMountWithDirectories -inf: LibraryOnWork AdminDirectoriesGroup # Determine on which machine we mount library:/usr/nevis _after_ we # figure out which machine is running AdminDirectoriesGroup. "symmetrical=false" # means that if we're turning off the resource for some reason, we don't # have to wait for LibraryOnWork to be stopped before we try to stop # AdminDirectoriesGroup (since these resources always run on different machines). configure order DirectoresBeforeLibrary inf: AdminDirectoriesGroup LibraryOnWork \ symmetrical=false # The standard condor execution service. As with all the batch nodes, # I've already configured /etc/condor/condor_config.local and created # scratch directories in /data/condor. configure primitive Condor lsb:condor # If we're able mount library:/usr/nevis, then it's safe to start condor. # If we can't mount library:/usr/nevis, then condor will never be started. configure colocation CondorWithLibrary inf: Condor LibraryOnWork # library:/usr/nevis must be mounted before condor starts. configure order LibraryBeforeCondor inf: LibraryOnWork Condor cib commit condor quit |