TWiki
>
Main Web
>
Computing
>
AdministrativeSinglePrimaryCluster
(revision 6) (raw view)
Edit
Attach
---+!! Nevis particle-physics administrative cluster <div style="float:right; background-color:#EBEEF0; margin:0 0 20px 20px; padding: 0 10px 0 10px;"> %TOC{title="On this page:"}% </div> This is a description of the organization of the administrative computers on the Nevis [[Linux cluster]]. The emphasis is on describing the [[http://en.wikipedia.org/wiki/High-availability_cluster][high-availability]] cluster, because they are relatively new to the world of physics. ---++ Background ---+++ A single system In the 1990s, Nevis computing centered on a single computer, =nevis1=. The majority of the users used this machine to analyze data, access their e-mail, set up web sites, etc. Although the system (an SGI Challenge XL) was relatively powerful for its time, this organization had some disadvantages: * All the users had to share the processing queues. It was possible that one user could dominate the computer, preventing anyone else from running their own analysis jobs. * If it became necessary to restart the computer, it had to be scheduled in advance (typically two weeks), since such a restart would affect almost everyone at Nevis. * If the system's security became compromised, it affected everyone and everything on the system. ---+++ A distributed cluster In the 2000's, =nevis1= was gradually replaced by [[ListOfMachines][many Linux boxes]]. Administrative services were moved to separate systems, typically one service per box; e.g., there was a [[mail]] server, a [[http://en.wikipedia.org/wiki/Domain_Name_System][DNS]] server, a [[http://en.wikipedia.org/wiki/Samba_%28software%29][Samba]] server, etc. Each working group at Nevis purchased their own server and managed their own disk space. The above issues were resolved: * Jobs could be sent to the [[condor]] batch system, so that no one system would be slowed down due to a user's jobs. * A single computer system could be restarted without affecting most of the rest of the cluster; e.g., if the mail server needed to be rebooted, it didn't affect the physics analysis. * If a server because compromised, e.g., the web server, the effects could be restricted to that server. This configuration worked well for a few years, but some issues arose over time: * Each working group would purchase and maintain their server with their own funds. However, the administrative servers had to be purchased with Nevis' infrastructure funds. That meant that the administrative servers would be replaced rarely, if at all. * As a result, the administrative servers tended to be older, recycled, or inexpensive systems, with a correspondingly higher risk of failure. * At one point there were seven administrative servers in our computer area, each one requiring an [[ups][uninterruptible power supply]], and each one contributing to the power and heat load of the computer room. ---++ High-availability ---+++ Tools Over the last few years, there has been substantial work done in the open-source community towards high-availability servers. To put it simply, a service can be offered by a machine on a high-availability (HA) cluster. If that machine fails, the service automatically transfers to another machine on the HA cluster. The software packages used to implement HA on our cluster are [[http://www.clusterlabs.org/][Corosync with Pacemaker]]. Another open-source development was virtual machines for Linux. If you've ever used [[http://www.vmware.com/products/workstation/][VMware]], you're already familiar with the concept. The software (actually, a kernel extension) used to implement virtual machines in Linux is called [[http://www.xen.org/][Xen]]. The final "piece of the puzzle" is a software package plus kernel extension called [[http://www.drbd.org/][DRBD]]. The simplest way to understand DRBD is to think of it as [[http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_1][RAID1]] between two computers: when one computer makes a change to a disk partition, the other computer makes the identical change to its copy of the partition. In this way, the disk partition is mirrored between the two systems. These tools can be used to solve the issues listed above: * Six different computers, all old, could be replaced by two new systems. * The systems could be new, but relatively inexpensive. If one of them failed, the other one would automatically take up the services that the first computer offered. * The disk images of the two computers, including the virtual machines, could be kept automatically synchronized (as opposed to being copied via a script run at regular intervals). * Virtual machines are essentially large (~10GB) disk files. They can be manipulated as if they were separate computers (e.g., rebooted when needed) but can also be copied like disk files: * If a virtual server had its security broken, it could be quickly replaced by a older, un-hacked copy of that virtual machine. * If a virtual server required a complicated, time-consuming upgrade, a copy could be upgraded instead, then quickly swapped with minimal interruption. The price to be paid for all this sophistication is an increase in complexity. This page (and its companion page on the detailed [[Corosync configuration]]) are an attempt to explain the details. ---+++ Configuration ---++++ The non-HA server First, let's go over an administrative server that is not part of the HA cluster: =hermes=. This server provides the following functions: * [[http://en.wikipedia.org/wiki/Domain_Name_System][DNS]] master server; * [[http://en.wikipedia.org/wiki/Network_Information_Service][NIS]] slave server; * [[condor]] batch manager; * [[http://en.wikipedia.org/wiki/Syslog][log]] server. These services are not part of the HA cluster because: * in the hopefully unlikely event that the HA cluster needs to be rebooted, it's nice to have the services on =hermes= still available during the reboot; * it takes a bit of time for the HA cluster to start up after, e.g., a [[ups][power outage]]; again, it's nice to have the above services immediately available. ---++++ The HA cluster The two high-availability servers are =hypatia= and =orestes=. For the sake of simplicity, =hypatia= is normally the "main" server and =orestes= the backup server. ---+++++ Disk configuration A sketch of the disk organization of the high-availability servers: <br /> <img src="%ATTACHURLPATH%/HAServersSketch.jpg" alt="HAServersSketch.jpg" width='514' height='501' /> <br /> Text description: * Each system has two 200GB drives, grouped together in a RAID1. These drives contain the operating system and any scratch disk space. * Each system has one 1 TB drive, and one 1.5 TB drive. * Main resource partition (admin): * 1 TB of disk space from each of these drives is grouped in a RAID1. * These two RAID1 areas are in turn grouped between the systems via DRBD into a single mirrored partition, =/dev/drbd1=. * Using [[http://en.wikipedia.org/wiki/Logical_Volume_Manager_%28Linux%29][LVM]], this partition is carved into directories which are exported to other machines, both real and virtual: * =/usr/nevis= is the application library partition exported to every other system on the [[Linux cluster]]. * =/var/nevis= contains several resource directories; e.g., =/var/nevis/www= contains the Apache web server files; =/var/nevis/dhcpd= contains the [[http://en.wikipedia.org/wiki/Dhcpd][DHCP]] work files. * =/var/lib/nfs= contains the files that keep track of the [[http://en.wikipedia.org/wiki/Network_File_System_%28protocol%29][NFS]] state associated with exporting these directories to other systems. * =/mail= contains the [[Mail-relatedFiles][work files]] used by the [[mail]] server, including user's INBOXes and [[IMAPMailFiles][IMAP]] folders. * =/xen= contains the disk images of the [[http://en.wikipedia.org/wiki/Xen][Xen]] virtual machines. * Scratch resource partition (work): * The excess 0.5 TB is grouped between the systems into a single mirrored partition, =/work=. * This area is used for developing new virtual machines, and for making "snapshots" of existing virtual machines. ---+++++ Network configuration Both =hypatia= and =orestes= have two [[http://en.wikipedia.org/wiki/Ethernet][Ethernet]] ports: * Port =eth0= is used for all external network traffic; [[http://en.wikipedia.org/wiki/Vlan][VLANs]] are used in order for this port to accept traffic from any of the Nevis [[networks]]. * Port =eth1= is used for the internal HA cluster traffic; the two systems are connected directly two each other using a single Ethernet cable. Both [[http://www.drbd.org/][DRBD]] and [[http://www.clusterlabs.org/][Corosync]] have been configured to use this port exclusively. One might access the systems via their fixed names of =hypatia= and =orestes=, but this would not be useful if the HA services were moved from one system to the other. Among the HA resources (see below) that are managed by the systems are "generic" IP addresses assigned to the cluster. The IP name =hamilton.nevis.columbia.edu= always points to the system that offering the important cluster resources; the name =burr.nevis.columbia.edu= always points to the system offering "scratch" resources. Of course, if one of these systems goes down, then these two aliases will point to the same box. In general, this means that if you need to access the system offering the main cluster resources, always use the name =hamilton=. ---+++++ Resource configuration In HA terms, a "resource" means "anything you want to keep available all the time." What follows is an outline of the resources configured for our HA cluster. In this outline, an indent means that the resource depends on one above it; for example, the mail-server virtual machine won't start if NFS is not available; NFS won't start if =/var/lib/nfs= is not available. The entire configuration is spelled out in (excruciating?) detail on a separate [[Corosync configuration]] page. <verbatim> Services controlled by corosync: main node: Admin:Master = the DRBD "admin" partition's main image (Constraint: +100 to be on hypatia) MainIPGroup: IP = 129.236.252.11 (hamilton = library = time = print) IP = 10.44.7.11 IP = 10.43.7.11 LVM = makes the following logical volumes on the admin partition visible Filesystem: /usr/nevis Filesystem: /mail Filesystem: /var/nevis Filesystem: /var/lib/nfs lsb::cups lsb::xinetd (includes tftp and ftp) lsb::dhcp (ln -sf /var/nevis/dhcpd /var/lib/dhcpd) lsb::nfs Xen virtual machines: sullivan (mailing list) tango (Samba) ada (= www; web server) franklin (= mail; mail server) hogwarts (= staff accounts for non-login users) Work:Master = the DRBD "work" partition's main image Filesystem: /work assistant node: AssistantIPGroup (Constraint: -1000 to be on same system as hamilton) IP = 129.236.252.10 (burr = assistant) IP = 10.44.7.10 mount library:/usr/nevis lsb::condor (Constraint: -INF for AdminDirectoriesGroup; if everything is running on one box, stop running condor) On both systems: the STONITH resources. </verbatim> ---++ References These are the web sites I used to develop the HA cluster configuration at Nevis. ---+++ Corosync/Pacemaker http://www.clusterlabs.org/wiki/Main_Page <br /> http://www.clusterlabs.org/rpm/ <br /> http://theclusterguy.clusterlabs.org/post/178680309/configuring-heartbeat-v1-was-so-simple <br /> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ <br /> http://www.ourobengr.com/ha ---+++ DRBD http://www.drbd.org/home/feature-list/ <br /> http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 <br /> http://howtoforge.com/highly-available-nfs-server-using-drbd-and-heartbeat-on-debian-5.0-lenny ---+++ Xen virtual machines http://virt-manager.et.redhat.com/download.html <br /> http://wiki.xensource.com/xenwiki/XenNetworking <br /> http://toic.org/2008/10/06/multiple-network-interfaces-in-xen/
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r9
<
r8
<
r7
<
r6
<
r5
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r6 - 2010-10-01
-
WilliamSeligman
Main
Log In
or
Register
Main Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
Webs
ATLAS
DOE
Main
TWiki
Veritas
Copyright © 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback