Difference: Ups (1 vs. 7)

Revision 72018-10-22 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Nevis UPS Management

Line: 49 to 49
 
  • The problem is if there's an intermediate-length power outage (20-30 minutes). The servers go down in response to the low-battery signal from the UPSes, but the UPSes don't have time to fully drain their batteries. This means the servers don't come back up, since they went down via an internal "shutdown -h" command and their AC power is never interrupted.
Changed:
<
<
To solve this problem, one node has been designated the "wake-up" box. As of Apr-2012 it's hermes01.nevis.columbia.edu, but it could be any node not connected to a UPS. It goes down when power is cut off, comes back up when power is restored. As soon as it comes back up, it sends a wake-up signal to the servers. If a server has IPMI, it uses that; otherwise it sends a wake-on-lan signal.
>
>
To solve this problem, one node has been designated the "wake-up" box. As of Apr-2016 it's kennel00.nevis.columbia.edu, but it could be any node not connected to a UPS. It goes down when power is cut off, comes back up when power is restored. As soon as it comes back up, it sends a wake-up signal to the servers. If a server has IPMI, it uses that; otherwise it sends a wake-on-lan signal.
  The BIOS of the "wake-up" is set to bring up the system quickly, unlike the BIOSes of the other systems which are set to delay for as long as possible to give the main servers a chance to come back up. This means the "wake-up" box might come up before NIS and NFS is available on the cluster. This seems a reasonable price to pay for having the rest of the cluster come back up in a working state.

Revision 62014-07-07 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Nevis UPS Management

Line: 9 to 9
 

Background

Changed:
<
<
Power outages are a fact of life at Nevis. It's not unusual to have two or three multi-hour outages per year.
>
>
Power outages are a fact of life at Nevis. It's not unusual to have two or three multi-hour outages per year.
 
Changed:
<
<
To protect the systems from surges or equipment damage due to sudden power loss, all of the computer servers and workstations in the Room 119 computer enclosure at Nevis are protected by uninterruptble power supplies ("UPS"); other devices (e.g.: the firewall in the Nevis network room; the server in the Nevis Annex) are connected to UPSes as well. Note that none of the processing nodes on the condor batch farm are connected to a UPS; they are not considered "critical" systems.
>
>
To protect the systems from surges or equipment damage due to sudden power loss, all of the computer servers and workstations in the Room 119 computer enclosure at Nevis are protected by uninterruptble power supplies ("UPS"); other devices (e.g.: the firewall in the Nevis network room; the server in the Nevis Annex) are connected to UPSes as well. Note that none of the processing nodes on the condor batch farm are connected to a UPS; they are not considered "critical" systems.
 
Changed:
<
<
As you can see from the UPS status page, the UPSes can supply power to the various systems with times ranging from about 10 to 60 minutes. Since this time is shorter than a typical multi-hour power outage at Nevis, there is a system in place to shutdown the systems when the UPS batteries get low on power, and to automatically turn on the systems again when power is restored. The idea is that (hopefully) the Nevis systems will respond properly and automatically in the event of a power outage, even during times when a system administrator is not immediately available.
>
>
As you can see from the UPS status page, the UPSes can supply power to the various systems with times ranging from about 10 to 60 minutes. Since this time is shorter than a typical multi-hour power outage at Nevis, there is a system in place to shutdown the systems when the UPS batteries get low on power, and to automatically turn on the systems again when power is restored. The idea is that (hopefully) the Nevis systems will respond properly and automatically in the event of a power outage, even during times when a system administrator is not immediately available.
 

Configuration

Line: 36 to 29
 
  • All of the UPSes have SNMP management cards attached, which communicate their status via the ethernet.
    • More than one system can monitor the UPS status simultaneously. This is especially useful if the UPS supplies power to a network switch, as noted above.
Changed:
<
<
    • It allows a systems administrator to reboot a system remotely, by shutting off the UPS's power via the management card. This has already proved useful when a system has gotten into a state in which it was not possible to log in; the system can be rebooted without waiting for the sysadmin to travel to Nevis.

      For this reason, all the major servers (including the mail server) have UPSes with SNMP management cards.

>
>
    • It allows a systems administrator to reboot a system remotely, by shutting off the UPS's power via the management card. This has already proved useful when a system has gotten into a state in which it was not possible to log in; the system can be rebooted without waiting for the sysadmin to travel to Nevis.
 
  • The SNMP management cards are all on the private network, while many of the systems that monitor them are on the public network. All public<->private network traffic goes through the firewall. That means if the firewall goes down, the systems would lose connection to important UPSes. So if the firewall battery goes critical, all the Nevis cluster systems shut down (with the exceptions noted below).
Line: 44 to 37
 
  • The BIOS on all the systems has been set to automatically start the system back up on AC power restore. If that was not set, then the system would remain off even after Nevis power came back on and the UPS began supplying power to the system again.
Changed:
<
<
  • Once a week, hypatia.nevis.columbia.edu sends a command to each UPS to test its status. Once a month, hypatia sends a command to calibrate each UPS' battery under its current load. These tests are run in the early-morning hours, between 2AM and 5AM.

    Aside from keeping the UPS status page accurate, these tests help assure us that the UPS batteries are functioning properly. Typically, a UPS battery has to be replaced about once every five years; these tests let us know when it's time for a replacement.

>
>
  • Once a week, ada.nevis.columbia.edu sends a command to each UPS to test its status. Once a year, ada sends a command to calibrate each UPS' battery under its current load. These tests are run in the early-morning hours, between 2AM and 5AM.

    Aside from keeping the UPS status page accurate, these tests help assure us that the UPS batteries are functioning properly. Typically, a UPS battery has to be replaced about once every five years; these tests let us know when it's time for a replacement.

 

"Wake-up" box

Revision 52012-04-30 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Nevis UPS Management

Line: 19 to 19
 are connected to a UPS; they are not considered "critical" systems.
Changed:
<
<
As you can see from the UPS status page, the UPSes can supply power to the various systems with times ranging from about 10 to 60 minutes. Since this time is shorter than a typical multi-hour power outage at Nevis, there is a system in place to shutdown the systems when the UPS batteries get low on power, and to automatically turn on the systems again when power is restored. The idea is that (hopefully) the Nevis systems will respond properly and automatically in the event of a power outage, even during times when a system administrator is not immediately available.
>
>
As you can see from the UPS status page, the UPSes can supply power to the various systems with times ranging from about 10 to 60 minutes. Since this time is shorter than a typical multi-hour power outage at Nevis, there is a system in place to shutdown the systems when the UPS batteries get low on power, and to automatically turn on the systems again when power is restored. The idea is that (hopefully) the Nevis systems will respond properly and automatically in the event of a power outage, even during times when a system administrator is not immediately available.
 

Configuration

Changed:
<
<
The software programs used to monitor the UPSes and control the attached systems are the Network UPS Tools or "NUT". The details of the NUT configuration, in /etc/ups, are not accessible to most users. Here's the general policy applied to configuring NUT on the various systems:
>
>
The software programs used to monitor the UPSes and control the attached systems are the Network UPS Tools or "NUT". The details of the NUT configuration, in /etc/ups, are not accessible to most users. Here's the general policy applied to configuring NUT on the various systems:
 
  • A UPS goes "critical" if both of the following are true:
    • There is no AC power being supplied to the UPS.
Line: 44 to 34
 
  • The network switches, including the firewall in the network room, are also attached to UPSes. If a system's network is connected to a switch whose UPS goes critical, NUT will shut down the system.

    The idea is if a system loses its network connectivity, odds are that its NIS and automount services will get into a bizarre state that would delay or prevent the completion of a shutdown. It's best to issue the shutdown command before that occurs.

Changed:
<
<
  • Allt of the UPSes have SNMP management cards attached, which communicate their status via the ethernet.
>
>
  • All of the UPSes have SNMP management cards attached, which communicate their status via the ethernet.
 
    • More than one system can monitor the UPS status simultaneously. This is especially useful if the UPS supplies power to a network switch, as noted above.
    • It allows a systems administrator to reboot a system remotely, by shutting off the UPS's power via the management card. This has already proved useful when a system has gotten into a state in which it was not possible to log in; the system can be rebooted without waiting for the sysadmin to travel to Nevis.

      For this reason, all the major servers (including the mail server) have UPSes with SNMP management cards.

Line: 55 to 45
 
  • The BIOS on all the systems has been set to automatically start the system back up on AC power restore. If that was not set, then the system would remain off even after Nevis power came back on and the UPS began supplying power to the system again.

  • Once a week, hypatia.nevis.columbia.edu sends a command to each UPS to test its status. Once a month, hypatia sends a command to calibrate each UPS' battery under its current load. These tests are run in the early-morning hours, between 2AM and 5AM.

    Aside from keeping the UPS status page accurate, these tests help assure us that the UPS batteries are functioning properly. Typically, a UPS battery has to be replaced about once every five years; these tests let us know when it's time for a replacement.

\ No newline at end of file
Added:
>
>

"Wake-up" box

Even with all the preparation described above, there's still a potential problem depending on the length of a power outage.

  • If there's a short power outage (e.g., a few seconds), the servers stay up (because of the UPSes); the nodes go down and come up when power comes back.

  • If there's a long power outage (4 hours+), the servers also go down in response to signals from the UPSes. The UPSes eventually run out of power. When power comes back, the UPSes restore power so the servers come back up.

  • The problem is if there's an intermediate-length power outage (20-30 minutes). The servers go down in response to the low-battery signal from the UPSes, but the UPSes don't have time to fully drain their batteries. This means the servers don't come back up, since they went down via an internal "shutdown -h" command and their AC power is never interrupted.

To solve this problem, one node has been designated the "wake-up" box. As of Apr-2012 it's hermes01.nevis.columbia.edu, but it could be any node not connected to a UPS. It goes down when power is cut off, comes back up when power is restored. As soon as it comes back up, it sends a wake-up signal to the servers. If a server has IPMI, it uses that; otherwise it sends a wake-on-lan signal.

The BIOS of the "wake-up" is set to bring up the system quickly, unlike the BIOSes of the other systems which are set to delay for as long as possible to give the main servers a chance to come back up. This means the "wake-up" box might come up before NIS and NFS is available on the cluster. This seems a reasonable price to pay for having the rest of the cluster come back up in a working state.

 \ No newline at end of file

Revision 42011-07-27 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Nevis UPS Management

Line: 44 to 44
 
  • The network switches, including the firewall in the network room, are also attached to UPSes. If a system's network is connected to a switch whose UPS goes critical, NUT will shut down the system.

    The idea is if a system loses its network connectivity, odds are that its NIS and automount services will get into a bizarre state that would delay or prevent the completion of a shutdown. It's best to issue the shutdown command before that occurs.

Changed:
<
<
  • Some UPSes supply power to more than one system; as of May-07, an example of this is that both lincoln and sullivan are plugged into the same UPS. In such a situation, one system is the "UPS master" and the other is the "UPS slave"; NUT on the master usually communicates directly with the UPS, while the slave gets the UPS status by communicating with the master. If the UPS goes critical, the slave will shutdown immediately; the master will wait a minute or so to give a chance for the slave to receive the critical signal.

  • Some UPSes communicate their status via serial cables, which can only be connected to a single system; that's the reason for the "master-slave" situation described in the previous point.

  • The rest of the UPSes have SNMP management cards attached, which communicate their status via the ethernet. This has two advantages:
>
>
  • Allt of the UPSes have SNMP management cards attached, which communicate their status via the ethernet.
 
    • More than one system can monitor the UPS status simultaneously. This is especially useful if the UPS supplies power to a network switch, as noted above.
    • It allows a systems administrator to reboot a system remotely, by shutting off the UPS's power via the management card. This has already proved useful when a system has gotten into a state in which it was not possible to log in; the system can be rebooted without waiting for the sysadmin to travel to Nevis.

      For this reason, all the major servers (including the mail server) have UPSes with SNMP management cards.

Line: 58 to 54
 
  • The BIOS on all the systems has been set to automatically start the system back up on AC power restore. If that was not set, then the system would remain off even after Nevis power came back on and the UPS began supplying power to the system again.
Deleted:
<
<
  • Some systems have an older BIOS that cannot be set to automatically start on AC power restore; polaris.nevis.columbia.edu is an example. The BIOS on those systems was fixed at the factory to go to the "last state": if the system was powered down normally, then when AC power is restored the system will remain down. On such systems, NUT has been configured to not issue a system shutdown when the attached UPS goes critical. So these "old-BIOS" systems run until their battery runs out of power, then crash; they come up immediately when the UPS starts providing power again.

    This is a risk; the point of a UPS and NUT is to help machines shut down and start up cleanly. However, it turns out the delays caused by waiting for a systems administrator to give such systems personal attention outweigh the risk.

  • Some UPSes (e.g., the one that supplies power to the mail server) do not turn on their power immediately after Nevis power is restored; they are set to delay a few minutes. The reason is that those systems will come up more smoothly if other Nevis systems are already on; this is the case for the mail server, which mounts a lot of directories from other systems.
 
  • Once a week, hypatia.nevis.columbia.edu sends a command to each UPS to test its status. Once a month, hypatia sends a command to calibrate each UPS' battery under its current load. These tests are run in the early-morning hours, between 2AM and 5AM.

    Aside from keeping the UPS status page accurate, these tests help assure us that the UPS batteries are functioning properly. Typically, a UPS battery has to be replaced about once every five years; these tests let us know when it's time for a replacement.

\ No newline at end of file

Revision 32010-05-27 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Nevis UPS Management

Line: 52 to 52
 
    • More than one system can monitor the UPS status simultaneously. This is especially useful if the UPS supplies power to a network switch, as noted above.
    • It allows a systems administrator to reboot a system remotely, by shutting off the UPS's power via the management card. This has already proved useful when a system has gotten into a state in which it was not possible to log in; the system can be rebooted without waiting for the sysadmin to travel to Nevis.

      For this reason, all the major servers (including the mail server) have UPSes with SNMP management cards.

Changed:
<
<
  • The SNMP management cards are all on the private network, while many of the systems that monitor them are on the public network. All public<->private network traffic goes through the firewall. That means if the firewall goes down, the systems would lose connection to important UPSes. So if the firewall battery goes critical, all the Nevis cluster systems shut down (with the exceptions noted below).
>
>
  • The SNMP management cards are all on the private network, while many of the systems that monitor them are on the public network. All public<->private network traffic goes through the firewall. That means if the firewall goes down, the systems would lose connection to important UPSes. So if the firewall battery goes critical, all the Nevis cluster systems shut down (with the exceptions noted below).
 
  • As of May-2007, the UPS attached to the firewall only supplies power for about 13 minutes. Given the previous point, this means that ten minutes into a power outage, the systems will start shutting themselves down; the three-minute buffer is to give time for the systems to shut down cleanly.

Revision 22010-05-21 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="Computing"

Nevis UPS Management

Line: 42 to 42
 
  • If a system is directly attached to a UPS, the system uses NUT to monitor when the UPS goes critical. If the attached UPS goes critical, NUT sends a shutdown command to the system.
Changed:
<
<
  • The network switches, including the firewall in the network room, are also attached to UPSes. If a system's network is connected to a switch whose UPS goes critical, NUT will shut down the system.

    The idea is if a system loses its network connectivity, odds are that its NIS and automount services will get into a bizarre state that would delay or prevent the completion of a shutdown. It's best to issue the shutdown command before that occurs.

>
>
  • The network switches, including the firewall in the network room, are also attached to UPSes. If a system's network is connected to a switch whose UPS goes critical, NUT will shut down the system.

    The idea is if a system loses its network connectivity, odds are that its NIS and automount services will get into a bizarre state that would delay or prevent the completion of a shutdown. It's best to issue the shutdown command before that occurs.

 
  • Some UPSes supply power to more than one system; as of May-07, an example of this is that both lincoln and sullivan are plugged into the same UPS. In such a situation, one system is the "UPS master" and the other is the "UPS slave"; NUT on the master usually communicates directly with the UPS, while the slave gets the UPS status by communicating with the master. If the UPS goes critical, the slave will shutdown immediately; the master will wait a minute or so to give a chance for the slave to receive the critical signal.

Revision 12010-04-07 - WilliamSeligman

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="Computing"

Nevis UPS Management

This page describes how Uninterruptble Power Supplies ("UPS") are monitored at Nevis. There's also a web page on which you can see the current UPS status.

Background

Power outages are a fact of life at Nevis. It's not unusual to have two or three multi-hour outages per year.

To protect the systems from surges or equipment damage due to sudden power loss, all of the computer servers and workstations in the Room 119 computer enclosure at Nevis are protected by uninterruptble power supplies ("UPS"); other devices (e.g.: the firewall in the Nevis network room; the server in the Nevis Annex) are connected to UPSes as well. Note that none of the processing nodes on the condor batch farm are connected to a UPS; they are not considered "critical" systems.

As you can see from the UPS status page, the UPSes can supply power to the various systems with times ranging from about 10 to 60 minutes. Since this time is shorter than a typical multi-hour power outage at Nevis, there is a system in place to shutdown the systems when the UPS batteries get low on power, and to automatically turn on the systems again when power is restored. The idea is that (hopefully) the Nevis systems will respond properly and automatically in the event of a power outage, even during times when a system administrator is not immediately available.

Configuration

The software programs used to monitor the UPSes and control the attached systems are the Network UPS Tools or "NUT". The details of the NUT configuration, in /etc/ups, are not accessible to most users. Here's the general policy applied to configuring NUT on the various systems:

  • A UPS goes "critical" if both of the following are true:
    • There is no AC power being supplied to the UPS.
    • The UPS battery goes "low"; that is, the UPS determines that it has 3-5 minutes of power left under its current load.

  • If a system is directly attached to a UPS, the system uses NUT to monitor when the UPS goes critical. If the attached UPS goes critical, NUT sends a shutdown command to the system.

  • The network switches, including the firewall in the network room, are also attached to UPSes. If a system's network is connected to a switch whose UPS goes critical, NUT will shut down the system.

    The idea is if a system loses its network connectivity, odds are that its NIS and automount services will get into a bizarre state that would delay or prevent the completion of a shutdown. It's best to issue the shutdown command before that occurs.

  • Some UPSes supply power to more than one system; as of May-07, an example of this is that both lincoln and sullivan are plugged into the same UPS. In such a situation, one system is the "UPS master" and the other is the "UPS slave"; NUT on the master usually communicates directly with the UPS, while the slave gets the UPS status by communicating with the master. If the UPS goes critical, the slave will shutdown immediately; the master will wait a minute or so to give a chance for the slave to receive the critical signal.

  • Some UPSes communicate their status via serial cables, which can only be connected to a single system; that's the reason for the "master-slave" situation described in the previous point.

  • The rest of the UPSes have SNMP management cards attached, which communicate their status via the ethernet. This has two advantages:
    • More than one system can monitor the UPS status simultaneously. This is especially useful if the UPS supplies power to a network switch, as noted above.
    • It allows a systems administrator to reboot a system remotely, by shutting off the UPS's power via the management card. This has already proved useful when a system has gotten into a state in which it was not possible to log in; the system can be rebooted without waiting for the sysadmin to travel to Nevis.

      For this reason, all the major servers (including the mail server) have UPSes with SNMP management cards.

  • The SNMP management cards are all on the private network, while many of the systems that monitor them are on the public network. All public<->private network traffic goes through the firewall. That means if the firewall goes down, the systems would lose connection to important UPSes. So if the firewall battery goes critical, all the Nevis cluster systems shut down (with the exceptions noted below).

  • As of May-2007, the UPS attached to the firewall only supplies power for about 13 minutes. Given the previous point, this means that ten minutes into a power outage, the systems will start shutting themselves down; the three-minute buffer is to give time for the systems to shut down cleanly.

  • The BIOS on all the systems has been set to automatically start the system back up on AC power restore. If that was not set, then the system would remain off even after Nevis power came back on and the UPS began supplying power to the system again.

  • Some systems have an older BIOS that cannot be set to automatically start on AC power restore; polaris.nevis.columbia.edu is an example. The BIOS on those systems was fixed at the factory to go to the "last state": if the system was powered down normally, then when AC power is restored the system will remain down. On such systems, NUT has been configured to not issue a system shutdown when the attached UPS goes critical. So these "old-BIOS" systems run until their battery runs out of power, then crash; they come up immediately when the UPS starts providing power again.

    This is a risk; the point of a UPS and NUT is to help machines shut down and start up cleanly. However, it turns out the delays caused by waiting for a systems administrator to give such systems personal attention outweigh the risk.

  • Some UPSes (e.g., the one that supplies power to the mail server) do not turn on their power immediately after Nevis power is restored; they are set to delay a few minutes. The reason is that those systems will come up more smoothly if other Nevis systems are already on; this is the case for the mail server, which mounts a lot of directories from other systems.

  • Once a week, hypatia.nevis.columbia.edu sends a command to each UPS to test its status. Once a month, hypatia sends a command to calibrate each UPS' battery under its current load. These tests are run in the early-morning hours, between 2AM and 5AM.

    Aside from keeping the UPS status page accurate, these tests help assure us that the UPS batteries are functioning properly. Typically, a UPS battery has to be replaced about once every five years; these tests let us know when it's time for a replacement.

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback