Difference: Condor (13 vs. 14)

Revision 142010-04-13 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="LinuxCluster"

Batch Services at Nevis

Line: 10 to 10
 

Batch manager

Changed:
<
<
The system responsible for administering batch services is condor.nevis.columbia.edu. Users typically cannot log in to this machine; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of condor.nevis.columbia.edu may be completely transparent to you.

To use any of the condor commands given below, you have to set it up:

setup condor
>
>
The system responsible for administering batch services is condor.nevis.columbia.edu. Users typically do not log in to this machine directly; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of condor.nevis.columbia.edu may be completely transparent to you.
 

Condor status and usage

Line: 24 to 20
  The condor system is most efficient when it's handling a large number of small jobs. Long jobs tend to clog up the queues, and prevent others from doing their work.
Changed:
<
<
As of Feb-2010, there is no system that gives some groups or users higher priority than others. However, condor comes with a default scheme for adjusting user priorities; here are the details.
>
>
As of Feb-2010, there is no system that gives some groups or users higher priority than others. However, condor comes with a default scheme for adjusting user priorities; here are the details.
  The practical upshots of condor's default priority scheme:
  • If you use condor a lot, other users will tend to get higher priority when they submit jobs.
Line: 53 to 49
 
  • If you have a large number of jobs to submit, the slower machine can chug away at a couple of them while the rest are waiting to execute on the faster processors.
Changed:
<
<
The best way to tell Condor that you'd prefer your job to execute on the faster machines is to use the Rank attribute in your submit file:
>
>
The best way to tell Condor that you'd prefer your job to execute on the faster machines is to use the Rank attribute in your submit file:
 
Rank = Mips
Changed:
<
<
With all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your submit file:
>
>
With all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your submit file:
 
Requirements = (Mips > 2000) 

This would restrict your job to the fastest processors on the cluster.

Line: 95 to 91
 condor_status -long batch04
Changed:
<
<
Another clue can come from using condor_q. If you have a job held with an ID of 44.20:
>
>
Another clue can come from using condor_q. If you have a job held with an ID of 44.20:
 
condor_q -analyze 44.20
Line: 128 to 124
 

Submitting batch jobs

Changed:
<
<
The batch job submission system we're using at Nevis is Condor, developed at the University of Wisconsin. You can learn more about Condor from the User's Manual.

To use Condor at Nevis, the simplest way is to use the setup command:

setup condor

This will set the variable $CONDOR_CONFIG to ~condor/etc/condor_config, and add ~condor/bin to your $PATH.

>
>
The batch job submission system we're using at Nevis is Condor, developed at the University of Wisconsin. You can learn more about Condor from the User's Manual.
 

Do you want 10,000 e-mails?

Line: 148 to 138
 

Use the vanilla environment

Changed:
<
<
Unless you've specifically used the condor_compile command to compile your programs, you must submit your jobs in the "vanilla" universe. Any program that uses shared libraries cannot use condor_compile, and this includes most of the physics software at Nevis. Therefore, you are almost certainly required to have the following line in a command script:
>
>
Unless you've specifically used the condor_compile command to compile your programs, you must submit your jobs in the "vanilla" universe. Any program that uses shared libraries cannot use condor_compile, and this includes most of the physics software at Nevis. Therefore, you are almost certainly required to have the following line in a command script:
 
universe = vanilla

Handling disk files

Changed:
<
<
As you read through the User's Manual chapter on job submission, note that we use a shared file system at Nevis.
>
>
As you read through the User's Manual chapter on job submission, note that we use a shared file system at Nevis.
 
Changed:
<
<
Because we use a shared file system at Nevis that's based on automount, it's a good idea to include the initialdir attribute in your command scripts. For example, when I submit a script that makes use of files in my directory /a/data/tanya/seligman/kit/TestArea/, I include the following line in my command script to make sure the executing machine has correctly mounted the directory:
>
>
Because we use a shared file system at Nevis that's based on automount, it's a good idea to include the initialdir attribute in your command scripts. For example, when I submit a script that makes use of files in my directory /a/data/tanya/seligman/kit/TestArea/, I include the following line in my command script to make sure the executing machine has correctly mounted the directory:
 
initialdir = /a/data/tanya/seligman/kit/TestArea/ 
Line: 166 to 156
  Sounds like a recipe for disaster, doesn't it? You can crash your server by writing 200 files at once via NFS. It's happened several times at Nevis.
Changed:
<
<
To solve this problem, don't rely on your job reading and writing files directly to a particular directory. Use commands like the following in your condor command file; look them up in the User's Manual:
>
>
To solve this problem, don't rely on your job reading and writing files directly to a particular directory. Use commands like the following in your condor command file; look them up in the User's Manual:
 
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
Line: 192 to 182
 but the following succeeds:
executable = /a/data/tanya/seligman/kit/TestArea/athena.csh
Added:
>
>

condor log files

If you want to see the condor daemons' log files for a machine with the name hostname, look in /a/data/hostname/condor/log. For example, to find out the "real" name of the current condor master server:

# host condor.nevis.columbia.edu
condor.nevis.columbia.edu is an alias for karthur.nevis.columbia.edu.
Then you can look at its log files:
# ls -blrth /a/data/karthur/condor/log
-rw-r--r-- 1 condor condor  153 2010-04-13 15:07 StarterLog
-rw-r--r-- 1 condor condor 473K 2010-04-13 16:29 SchedLog
-rw-r--r-- 1 root   root   591K 2010-04-13 16:29 MasterLog
-rw-r--r-- 1 root   root   788K 2010-04-13 17:15 StartLog
-rw-r--r-- 1 root   root   562K 2010-04-13 17:25 NegotiatorLog
-rw-r--r-- 1 root   root   296K 2010-04-13 17:25 CollectorLog
 

Examples

The standard condor examples

If you're just starting to learn Condor, a good way to start is to copy the Condor examples:

Changed:
<
<
cp -arv ~condor/condor-7.2.4/examples .
>
>
cp -arv /usr/share/doc/condor-*/examples .
 cd examples
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback