Difference: DiskSharing (2 vs. 3)

Revision 32011-07-08 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="LinuxCluster"

Disk sharing and the condor batch system

Line: 16 to 16
  There's a bigger problem when that server is one on which users' /home directories are located. The mail server checks your home directory every time you receive a mail message. If your home directory is not available, it slows down the mail server. If the mail server receives a lot of mail for users with home directories on that server, the mail server can slow to the point of uselessness, or crash. If the mail server crashes, the entire Linux cluster can crash.
Changed:
<
<
The choice is between potentially crashing the cluster and making you think about disk files. It seems pretty obvious that thinking is your only option.
>
>
The choice is between potentially crashing the cluster, and making you think about disk files. It's obvious: thinking is your only option.
 

Types of systems

Line: 26 to 26
 
  • Login servers. These are the systems with /home directories. You can login to these system from outside Nevis.
  • File servers. These are systems with large amounts of disk space. Each collaboration decides whether you can login to these systems from outside Nevis, or only from systems inside Nevis; it's possible you can't login to them at all, but modify their disks via automount.
  • Workstations, including the student boxes. If you come to Nevis and you don't have a laptop, you'll use of them. When no one is using a workstation, it's available for condor.
Changed:
<
<
  • Batch nodes. They provide processing queues for condor. You can't login to them, and they don't have much disk space. These are the bulk of the systems that run your jobs.
>
>
  • Batch nodes. They provide processing queues for condor. You can't login to them, and they don't have much disk space. These are the bulk of the systems that run your jobs. (The ATLAS T3 cluster has a different scheme.)
 

Directories and how they're shared

Note: This describes the "ideal" case, which as of Jul-2011 only applies to the Neutrino group. As other groups continue to add and maintain systems on the cluster, I'm going to encourage moving to the separate login/file server configuration. Until then, be careful; crashing the cluster looks like more fun than it is.

Changed:
<
<
There are many exceptions, but here are the general rules:
>
>
There are many exceptions (e.g., ATLAS T3), but here are the general rules:
 
  • Login servers have a /home partition of 100-200 GB, and a /scratch partition that uses the rest of the available disk space. These partitions are not exported to the batch nodes.
  • File servers have a /share partition of 100-200GB, and a /data partition that uses the rest of the available disk space; usually the latter is several TB. These partitions are exported to all the systems in the cluster; in particular, the batch nodes. The /share partition is exported read-only to the batch nodes, specifically to keep an automated process from filling it up.
Line: 100 to 100
 log = /a/data/amsterdam/jsmith/myjobs/mySimulation-$(Process).log
Added:
>
>
There is another solution: delete these "scrap" files once you no longer need them. Only the most intelligent and clever physicists remember to do this... and we have high hopes for you!
 
META FILEATTACHMENT attachment="Slide1.gif" attr="h" comment="How a file server shares files" date="1309982722" name="Slide1.gif" path="Slide1.gif" size="7919" user="WilliamSeligman" version="1"
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback