Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Computer room temperature and cluster shutdown | ||||||||
Line: 6 to 6 | ||||||||
As of July-2011, room 119 has two air conditioners: a main 5-ton unit that's over 18 years old, and a backup unit that's over 40 years old. Although both units are generally kept in good repair, it's become more common in recent years for one of them to fail; if both fail at the same time, the heat from the computer systems will cause the temperature in that room to rise to the point where it might damage the equipment. In order to preserve the systems and the files on their hard drives, there's an automated procedure for shutting down the computers in response to high temperatures in the computer room. | ||||||||
Changed: | ||||||||
< < | The idea is to try to keep the cluster in a useful state for as long as possible, while shutting down the less-necessary systems to keep the room cooler if possible. This is implemented as staged levels of escalation. Every ten minutes a script run to check the computer room's temperature![]() | |||||||
> > | The idea is to try to keep the cluster in a useful state for as long as possible, while shutting down the less-necessary systems to keep the room cool. This is implemented as staged levels of escalation. Every ten minutes a script run to check the computer room's temperature![]() | |||||||
|
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Computer room temperature and cluster shutdown | ||||||||
Line: 14 to 14 | ||||||||
| ||||||||
Changed: | ||||||||
< < | If at any point the temperature falls below a "recovery" temperature, currently 85 degrees, the escalation falls back to "nothing"; if the A/C has not cooled below 90 degrees with only hypatia running after 50 minutes, something is seriously wrong in that room. | |||||||
> > | If the A/C has not cooled the room to below 90 degrees with only hypatia running after 50 minutes, something is seriously wrong in that room. | |||||||
Changed: | ||||||||
< < | Note that there is no method of automatically starting these systems again. If they are shut down, they must be powered on manually. | |||||||
> > | If at any point the temperature falls below a "recovery" temperature, currently 85 degrees, the escalation falls back to "nothing." Note that there is no method of automatically starting these systems again. If they are shut down, they must be powered on manually. | |||||||
If you're logged into a Nevis system and you see a warning about a system shutdown due to temperature, log off immediately (though you don't have much choice). If you can still log in to any other system, save your work, cancel all condor jobs, and log off again. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Computer room temperature and cluster shutdownThe main computer area for the Nevis particle-physics groups is room 119; the ATLAS group has a separate Tier3![]() | ||||||||
Changed: | ||||||||
< < | As of July-2011, room 119 has two air conditioners: a main 5-ton unit that's over 25 years old, and a backup unit that's over 40 years old. Although both units are generally kept in good repair, it's become more common in recent years for one of them to fail; if both fail at the same time, the heat from the computer systems will cause the temperature in that room to rise to the point where it might damage the equipment. In order to preserve the systems and the files on their hard drives, there's an automated procedure for shutting down the computers in response to high temperatures in the computer room. | |||||||
> > | As of July-2011, room 119 has two air conditioners: a main 5-ton unit that's over 18 years old, and a backup unit that's over 40 years old. Although both units are generally kept in good repair, it's become more common in recent years for one of them to fail; if both fail at the same time, the heat from the computer systems will cause the temperature in that room to rise to the point where it might damage the equipment. In order to preserve the systems and the files on their hard drives, there's an automated procedure for shutting down the computers in response to high temperatures in the computer room. | |||||||
The idea is to try to keep the cluster in a useful state for as long as possible, while shutting down the less-necessary systems to keep the room cooler if possible. This is implemented as staged levels of escalation. Every ten minutes a script run to check the computer room's temperature![]() |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Added: | ||||||||
> > |
Computer room temperature and cluster shutdownThe main computer area for the Nevis particle-physics groups is room 119; the ATLAS group has a separate Tier3![]() ![]()
/usr/nevis/adm/ambient-temperature-check.pl |