Computer room temperature and cluster shutdown
The main computer area for the Nevis particle-physics groups is room 119; the ATLAS group has a separate
Tier3
cluster in room 86.
As of July-2011, room 119 has two air conditioners: a main 5-ton unit that's over 18 years old, and a backup unit that's over 40 years old. Although both units are generally kept in good repair, it's become more common in recent years for one of them to fail; if both fail at the same time, the heat from the computer systems will cause the temperature in that room to rise to the point where it might damage the equipment. In order to preserve the systems and the files on their hard drives, there's an automated procedure for shutting down the computers in response to high temperatures in the computer room.
The idea is to try to keep the cluster in a useful state for as long as possible, while shutting down the less-necessary systems to keep the room cool. This is implemented as staged levels of escalation. Every ten minutes a script run to check the computer room's
temperature
:
- If the temperature goes over threshold (currently 90 degrees), the batch nodes and the backup server will be shut down.
- If the temperature remains over threshold the next time the script is run, the file servers will be shut down.
- If the temperature is still over threshold the next time the script runs, the login servers and the workstations are shut down. At this point only the administrative servers are running.
- The next level of escalation is to shut down everything but hypatia.
- The last level is to shut down hypatia; this is the machine that runs this script, email, NIS, etc.
If the A/C has not cooled the room to below 90 degrees with only
hypatia running after 50 minutes, something is seriously wrong in that room.
If at any point the temperature falls below a "recovery" temperature, currently 85 degrees, the escalation falls back to "nothing." Note that there is no method of automatically starting these systems again. If they are shut down, they must be powered on manually.
If you're logged into a Nevis system and you see a warning about a system shutdown due to temperature, log off immediately (though you don't have much choice). If you can still log in to any other system, save your work, cancel all
condor jobs, and log off again.
If you want to look at the script, it's in
/usr/nevis/adm/ambient-temperature-check.pl