Details of running condor at Nevis

Suspended jobs

Suspended jobs

As noted elsewhere on this page, we generally use the vanilla universe at Nevis. This means if a job is suspended on a given machine, it can only continue on that particular machine. If that machine is running other jobs, then the suspended job must wait.


Extra disk space

In addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They were initially used by the ATLAS and D0 groups, but can be made available to other groups as the need arises. These disks are available via automount on the Linux cluster; each has a capacity of about 1.5TB.

The names of these RAID arrays are:

  • /a/data/condor/array1/
  • /a/data/condor/array2/

For example, the permissions on the drives have been set so that you can do the following from any machine on the Linux cluster (if you're a member of the ATLAS group):

cd /a/data/condor/array2/atlas/
mkdir $user
cd $user # ... create whatever files you want 

Important! If you're skimming this page, stop and read the following paragraph!

The files on these /data partitions, like those on the /data partitions of any other systems on the Nevis cluster, are not backed up. They are stored on RAID5 arrays, which are a reliable form of storage; there is monitoring software that warns if any individual drives have failed. However, RAID arrays have been known to fail (and we've had at least one such failure at Nevis). If you have any critical data stored on these drives, make sure you backup the files yourself.

One more time: the disks on these partitions are not backed up!

