Difference: Condor (9 vs. 10)

Revision 102010-03-06 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="LinuxCluster"

Batch Services at Nevis

Line: 12 to 12
  The system responsible for administering batch services is condor.nevis.columbia.edu. Users typically cannot log in to this machine; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of condor.nevis.columbia.edu may be completely transparent to you.
Added:
>
>
To use any of the condor commands given below, you have to set it up:

setup condor

Condor status and usage

The CondorView package has been set up so you can see how much of the batch cluster is in use, and by whom.

 

Fair use

The condor system is most efficient when it's handling a large number of small jobs. Long jobs tend to clog up the queues, and prevent others from doing their work.

Line: 24 to 32
  If you use the vanilla environment (see below), as most users at Nevis must, for a job to be "pre-empted" means that it is suspended and until that same machine has an available queue.
Added:
>
>
To get an idea of your user resource consumption and how it compares to other users, use these commands:
condor_userprio -allusers 
The larger the number, the lower your priority in comparison to the other users listed.
 

What processing power is available

The following commands will show you the machines available to run your jobs, their status, and their resources:

Line: 53 to 67
 

"Why isn't my job running on all the machines in the batch farm?"

Changed:
<
<
You didn't read the previous section, did you?
>
>
There may be several reasons:

The heterogenous cluster

 
Changed:
<
<
Here it is again: Not all machines in the farm are the same; they have different amounts of memory, disk space, and occasionally even installed libraries. Make sure you use the standardized compiler to compile your programs.
>
>
Not all machines in the farm are the same; they have different amounts of memory, disk space, and occasionally even installed libraries. Make sure you use the standardized compiler to compile your programs.
  You'll also want to set up the standard Nevis environment explicitly in your jobs. If you look at the example described below, you'll see that the shell scripts all contain command such as:
# If you're using bash:
Line: 65 to 81
  Finally, don't forget to set initialdir in your condor submit file.
Added:
>
>

The job requirements

There may be something explicit or implicit in the resources required to run your job. To pick an unrealistic example, if you job requires ksh and that shell isn't installed on machine, then it won't execute on the cluster. A more practical example: If you have the following in your job submit file:

Requirements = ( Memory > 1024)
then your job won't execute if the amount of memory per job queue is 1024 or less, including those machines with 1023 MB per queue to due rounding in the memory calculation.

If you think your job with ID 4402 should be able to execute on machine batch04, you can compare what condor thinks are the jobs requirements against what the machine offers:

condor_q -long -global 4402
condor_status -long batch04

Suspended jobs

As noted elsewhere on this page, we generally use the vanilla universe at Nevis. This means if a job is suspended on a given machine, it can only continue on that particular machine. If that machine is running other jobs, then the suspended job must wait.

 

Extra disk space

In addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They were initially used by the ATLAS and D0 groups, but can be made available to other groups as the need arises. These disks are available via automount on the Linux cluster; each has a capacity of about 1.5TB.

Line: 123 to 157
 
initialdir = /a/data/tanya/seligman/kit/TestArea/ 
Changed:
<
<

Important: Think about how you handle your job's files

>
>

Important: Think about how you transfer your job's files

  Picture this: You submit a condor batch procedure that runs thousands of jobs. Each one of those jobs reads and/or writes directly into a directory on your server, accessed via NFS. Let's say there are 200 batch queues in the Nevis condor system. That means that 200 jobs are trying to access the disk on your server at once.
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback