Difference: Condor (6 vs. 7)

Revision 72010-02-24 - WilliamSeligman

Line: 1 to 1
Changed:
<
<
META TOPICPARENT name="Trash.CosmicRay"

Batch Services at Nevis

>
>
META TOPICPARENT name="LinuxCluster"

Batch Services at Nevis

 
Changed:
<
<
This is a description of the batch job submission services available on the Linux cluster at Nevis Labs.
>
>
This is a description of the batch job submission services available on the Linux cluster at Nevis Labs.
 
Changed:
<
<

Batch and disk services

>
>

About the batch cluster

Batch manager

  The system responsible for administering batch services is condor.nevis.columbia.edu. Users typically cannot log in to this machine; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of condor.nevis.columbia.edu may be completely transparent to you.
Added:
>
>

Fair use

The condor system is most efficient when it's handling a large number of small jobs. Long jobs tend to clog up the queues, and prevent others from doing their work.

As of Feb-2010, there is no system that gives some groups or users higher priority than others. However, condor comes with a default scheme for adjusting user priorities; here are the details.

The practical upshots of condor's default priority scheme:

  • If you use condor a lot, other users will tend to get higher priority when they submit jobs.
  • If your job takes more than an hour to run, there's a chance it will be pre-empted; that chance increases the longer the job runs.

If you use the vanilla environment (see below), as most users at Nevis must, for a job to be "pre-empted" means that it is killed and will be re-started from the beginning when a machine becomes available.

What processing power is available

The following commands will show you the machines available to run your jobs, their status, and their resources:

condor_status 
condor_status -server 

Obviously, some machines are more powerful than others. Before you arbitrarily decide that only the most powerful machines are good enough for your jobs, consider:

  • It's true, a machine that's 1/4 as fast will take 4 times as long to execute your jobs. However, the demand for the faster machine may be more than four times as much; it's possible that your job will sit waiting in the queue longer than it would have taken to run on the slower box.

  • The CPU cycles on the slower machines are presently being wasted. You might be able to put them to some use.

  • If you have a large number of jobs to submit, the slower machine can chug away at a couple of them while the rest are waiting to execute on the faster processors.

The best way to tell Condor that you'd prefer your job to execute on the faster machines is to use the Rank attribute in your submit file:

Rank = Mips

With all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your submit file:

Requirements = (Mips > 2000) 

This would restrict your job to the fastest processors on the cluster.

All the machines on the batch farm are not the same

The batch farm is a heterogeneous collection of machines. If you're having problems with programs crashing on some systems but not on others, please read this page on compiler tools that can help solve this problem.

"Why isn't my job running on all the machines in the batch farm?"

You didn't read the previous section, did you?

Here it is again: Not all machines in the farm are the same; they have different amounts of memory, disk space, and occasionally even installed libraries. Make sure you use the standardized compiler to compile your programs.

You'll also want to set up the standard Nevis environment explicitly in your jobs. If you look at the example described below, you'll see that the shell scripts all contain command such as:

# If you're using bash:
shopt -s expand_aliases
source /usr/nevis/adm/nevis-init.sh
setup root geant4

Finally, don't forget to set initialdir in your condor submit file.

Extra disk space

 In addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They were initially used by the ATLAS and D0 groups, but can be made available to other groups as the need arises. These disks are available via automount on the Linux cluster; each has a capacity of about 1.5TB.

The names of these RAID arrays are:

Line: 32 to 89
 

Submitting batch jobs

Changed:
<
<
The batch job submission system we're using at Nevis is Condor, developed at the University of Wisconsin. You can learn more about Condor from the User's Manual.
>
>
The batch job submission system we're using at Nevis is Condor, developed at the University of Wisconsin. You can learn more about Condor from the User's Manual.
  To use Condor at Nevis, the simplest way is to use the setup command:
Line: 40 to 97
  This will set the variable $CONDOR_CONFIG to ~condor/etc/condor_config, and add ~condor/bin to your $PATH.
Deleted:
<
<

What processing power is available

The following commands will show you the machines available to run your jobs, their status, and their resources:

condor_status 
condor_status -server 

Obviously, some machines are more powerful than others. Before you arbitrarily decide that only the most powerful machines are good enough for your jobs, consider:

  • It's true, a machine that's 1/4 as fast will take 4 times as long to execute your jobs. However, the demand for the faster machine may be more than four times as much; it's possible that your job will sit waiting in the queue longer than it would have taken to run on the slower box.

  • The CPU cycles on the slower machines are presently being wasted. You might be able to put them to some use.

  • If you have a large number of jobs to submit, the slower machine can chug away at a couple of them while the rest are waiting to execute on the faster processors.

The best way to tell Condor that you'd prefer your job to execute on the faster machines is to use the Rank attribute in your submit file:

Rank = Mips

With all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your submit file:

Requirements = (Mips > 2000) 

This would restrict your job to the fastest processors on the cluster.

 

Use the vanilla environment

We have discovered that the vanilla environment described in the Condor manual does not behave exactly as documented at Nevis. The following advice may be helpful:

Changed:
<
<
Unless you've specifically used the condor_compile command to compile your programs, you must submit your jobs in the "vanilla" universe. In particular, the Athena and D0 distribution kits do not use condor_compile, and must have the following line in a command script that makes use of those kits:
>
>
Unless you've specifically used the condor_compile command to compile your programs, you must submit your jobs in the "vanilla" universe. In particular, the Athena and D0 distribution kits do not use condor_compile, and must have the following line in a command script that makes use of those kits:
 
universe = vanilla
Line: 79 to 113
 
initialdir = /a/data/tanya/seligman/kit/TestArea/ 
Added:
>
>

Important: Think about how you handle your job's files

Picture this: You submit a condor batch procedure that runs thousands of jobs. Each one of those jobs reads and/or writes directly into a directory on your server, accessed via NFS. Let's say there are 200 batch queues in the Nevis condor system. That means that 200 jobs are trying to access the disk on your server at once.

Sounds like a recipe for disaster, doesn't it? You can crash your server by writing 200 files at once via NFS. It's happened several times at Nevis.

To solve this problem, don't rely on your job reading and writing files directly to a particular directory. Use commands like the following in your condor command file; look them up in the User's Manual:

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = ...list of input files...

This will transfer your input files to the condor master server once, instead of 200 times; as your job is executing, it will write the output on a local area of the machine that's running the job. Once the job has finished executing, it will transfer the output file to the directory from which you submitted the job.

Result: you don't crash your server. You also don't clog up the network with unnecessary file transfers.

 

sh-style shells versus csh-style shells

There appears to be a difference in the way the sh and csh shells handle files in Condor. In sh, bash, or zsh (the default at Nevis) the examples in the Condor manual basically work as they are. In csh or tcsh, the scripts will fail due to "file not found" errors unless you do one of the following:

Line: 91 to 142
 
executable = athena.csh
but the following succeeds:
executable = /a/data/tanya/seligman/kit/TestArea/athena.csh
Deleted:
<
<

All the machines on the batch farm are not the same

The batch farm is a heterogenous collection of machines; that is, they're not all running the same version of Fedora, nor do they all have the same version of gcc installed. If you're having problems with programs crashing on some systems but not on others, please read this page on compiler tools that can help solve this problem.

"Why isn't my job running on all the machines in the batch farm?"

You didn't read the previous section, did you?

Here it is again: Not all machines in the farm are the same; they run different versions of Fedora. Make sure you use the standardized compiler to compile your programs.

You'll also want to set up the standard Nevis environment explicitly in your jobs. If you look at the example described below, you'll see that the shell scripts all contain command such as:

source /usr/nevis/adm/nevis-init.sh
setup root geant4

Finally, don't forget to set initialdir in your condor submit file.

 

Examples

The standard condor examples

If you're just starting to learn Condor, a good way to start is to copy the Condor examples:

Changed:
<
<
cp -arv ~condor/condor-7.0.1/examples .
>
>
cp -arv ~condor/condor-7.2.4/examples .
 cd examples
Line: 123 to 157
 

Examples that incorporate the tips on this page

Changed:
<
<
Many of the above tips, and others, have been combined into a set of example scripts. They are in ~seligman/condor/; start with the README file, which will point you to the other relevant files in the directory.
>
>
Many of the above tips, and others, have been combined into a set of example scripts. The Athena-related scripts are in ~seligman/condor/; start with the README file, which will point you to the other relevant files in the directory. Note that those examples were prepared in 2005.
 

Submitting multiple jobs with one condor_submit command

An ATLAS example: Running Multiple Jobs On Condor

Deleted:
<
<
As of Jun-2008, you can find several examples of multiple job submission in /a/home/riverside/seligman/nusong/aria/work; these go further with the tips in the above link, to generate both numeric and text parameters that vary according to condor's process ID. Look in the *.cmd files, which will lead you in turn to some of the *.sh files in that directory. There are hopefully enough comments in those scripts to get you started.
 \ No newline at end of file
Added:
>
>
As of Jun-2008, you can find several examples of multiple job submission in /a/home/riverside/seligman/nusong/aria/work; these go further with the tips in the above link, to generate both numeric and text parameters that vary according to condor's process ID. Look in the *.cmd files, which will lead you in turn to some of the *.sh files in that directory. There are hopefully enough comments in those scripts to get you started.
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback