Difference: Ups (5 vs. 6)

Revision 62014-07-07 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Nevis UPS Management

Line: 9 to 9
 

Background

Changed:
<
<
Power outages are a fact of life at Nevis. It's not unusual to have two or three multi-hour outages per year.
>
>
Power outages are a fact of life at Nevis. It's not unusual to have two or three multi-hour outages per year.
 
Changed:
<
<
To protect the systems from surges or equipment damage due to sudden power loss, all of the computer servers and workstations in the Room 119 computer enclosure at Nevis are protected by uninterruptble power supplies ("UPS"); other devices (e.g.: the firewall in the Nevis network room; the server in the Nevis Annex) are connected to UPSes as well. Note that none of the processing nodes on the condor batch farm are connected to a UPS; they are not considered "critical" systems.
>
>
To protect the systems from surges or equipment damage due to sudden power loss, all of the computer servers and workstations in the Room 119 computer enclosure at Nevis are protected by uninterruptble power supplies ("UPS"); other devices (e.g.: the firewall in the Nevis network room; the server in the Nevis Annex) are connected to UPSes as well. Note that none of the processing nodes on the condor batch farm are connected to a UPS; they are not considered "critical" systems.
 
Changed:
<
<
As you can see from the UPS status page, the UPSes can supply power to the various systems with times ranging from about 10 to 60 minutes. Since this time is shorter than a typical multi-hour power outage at Nevis, there is a system in place to shutdown the systems when the UPS batteries get low on power, and to automatically turn on the systems again when power is restored. The idea is that (hopefully) the Nevis systems will respond properly and automatically in the event of a power outage, even during times when a system administrator is not immediately available.
>
>
As you can see from the UPS status page, the UPSes can supply power to the various systems with times ranging from about 10 to 60 minutes. Since this time is shorter than a typical multi-hour power outage at Nevis, there is a system in place to shutdown the systems when the UPS batteries get low on power, and to automatically turn on the systems again when power is restored. The idea is that (hopefully) the Nevis systems will respond properly and automatically in the event of a power outage, even during times when a system administrator is not immediately available.
 

Configuration

Line: 36 to 29
 
  • All of the UPSes have SNMP management cards attached, which communicate their status via the ethernet.
    • More than one system can monitor the UPS status simultaneously. This is especially useful if the UPS supplies power to a network switch, as noted above.
Changed:
<
<
    • It allows a systems administrator to reboot a system remotely, by shutting off the UPS's power via the management card. This has already proved useful when a system has gotten into a state in which it was not possible to log in; the system can be rebooted without waiting for the sysadmin to travel to Nevis.

      For this reason, all the major servers (including the mail server) have UPSes with SNMP management cards.

>
>
    • It allows a systems administrator to reboot a system remotely, by shutting off the UPS's power via the management card. This has already proved useful when a system has gotten into a state in which it was not possible to log in; the system can be rebooted without waiting for the sysadmin to travel to Nevis.
 
  • The SNMP management cards are all on the private network, while many of the systems that monitor them are on the public network. All public<->private network traffic goes through the firewall. That means if the firewall goes down, the systems would lose connection to important UPSes. So if the firewall battery goes critical, all the Nevis cluster systems shut down (with the exceptions noted below).
Line: 44 to 37
 
  • The BIOS on all the systems has been set to automatically start the system back up on AC power restore. If that was not set, then the system would remain off even after Nevis power came back on and the UPS began supplying power to the system again.
Changed:
<
<
  • Once a week, hypatia.nevis.columbia.edu sends a command to each UPS to test its status. Once a month, hypatia sends a command to calibrate each UPS' battery under its current load. These tests are run in the early-morning hours, between 2AM and 5AM.

    Aside from keeping the UPS status page accurate, these tests help assure us that the UPS batteries are functioning properly. Typically, a UPS battery has to be replaced about once every five years; these tests let us know when it's time for a replacement.

>
>
  • Once a week, ada.nevis.columbia.edu sends a command to each UPS to test its status. Once a year, ada sends a command to calibrate each UPS' battery under its current load. These tests are run in the early-morning hours, between 2AM and 5AM.

    Aside from keeping the UPS status page accurate, these tests help assure us that the UPS batteries are functioning properly. Typically, a UPS battery has to be replaced about once every five years; these tests let us know when it's time for a replacement.

 

"Wake-up" box

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback