Difference: Condor (21 vs. 22)

Revision 222010-12-16 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="LinuxCluster"

Batch Services at Nevis

Line: 12 to 12
  You have a program and perhaps a script. You just want to submit it and start thinking about physics again. But before you use condor, you have to think about resource management: How does condor know which files you'll need for input? Where will the files go on output?
Changed:
<
<
Although the section on Resource Management is in the middle of this page, where it fits logically, it's the most important aspect of your condor job. Once you understand the concepts, the rest of condor is relatively easy.
>
>
Although the section on Resource Management is in the middle of this page, where it fits logically, it's the most important aspect of your condor job. Once you understand the concepts, the rest of condor is relatively easy.
 

Getting started

Line: 39 to 39
  As of Jun-2008, you can find several examples of multiple job submission in /a/home/riverside/seligman/nusong/aria/work; these go further with the tips in the above link, to generate both numeric and text parameters that vary according to condor's process ID. Look in the *.cmd files, which will lead you in turn to some of the *.sh files in that directory. There are hopefully enough comments in those scripts to get you started.
Added:
>
>

Resource Management

Important: Think about how you transfer your job's files

Picture this: You submit a condor batch procedure that runs thousands of jobs. Each one of those jobs reads and/or writes directly into a directory on your server, accessed via an automount path like /a/data/tanya; this means you're using NFS. Let's say there are 300 batch queues in the Nevis condor system. That means that 300 jobs are trying to access the disk on your server at once.

Sounds like a recipe for disaster, doesn't it? You can crash your server by writing 300 files at once via NFS. It's happened several times at Nevis.

To partially solve this problem, the condor batch nodes are blocked from writing to the /home and /data partitions on the servers.

In order to get around this, don't rely on your job reading and writing files directly to a particular directory. Use commands like the following in your condor command file; look them up in the User's Manual:

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = ...list of input files...
initialdir = ...where your inputs and outputs are located...

This will transfer your input files to the condor master server once, instead of 300 times; as your job is executing, it will write the output on a local area of the machine that's running the job. Once the job has finished executing, it will transfer the output file to the initialdir directory.

Unless you specify a file's full pathname in the condor command file, the files will be copied to and from initialdir (see below).

Memory limits

The systems on the condor batch cluster have enough RAM for 1GB/processing queue. This means if your job uses more than 1GB of memory, there can be a problem. For example, if your job required 2GB of memory, and a condor batch node had 16 queues, then your 16 jobs will require 32GB of RAM, twice as much as the machine has. The machine will start swapping memory pages continuously, and essentially halt.

To keep this from happening, condor will automatically cancel a job that requires more than 1GB of RAM. Unfortunately, condor has a problem estimating the amount of memory required by a running job: if a program uses threads, it will tend to overestimate; if a program uses shared libraries, it tends to underestimate.

Therefore, if you find that your large simulation program is being "spontaneously" canceled, look at its memory use.

Submitting batch jobs

Do you want 10,000 e-mails?

By default, condor will send you an e-mail message as each of your jobs completes. If you've submitted 10,000 jobs, that means 10,000 e-mails. This can clog the mail server, and make your life miserable.

Please place the following line in your condor submit file:

Notification    = Error

This means that condor will only send you an e-mail if there's an error while running the job.

Do you want to use up all your disk space?

At the end of most condor batch files, you'll see lines that look like this:

output   = mySimulation-$(Process).out
error    = mySimulation-$(Process).err
log      = mySimulation-$(Process).log

These lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. Sooner or later you'll fill up your home directory. Since you share the home directory on your server with everyone else in your working group, that means everyone in your group will be affected.

The general solution is to not write your output files into your home directory. Every workgroup server has a /data partition, which is normally several TB in size. It's a good idea to make sure your output files are written to this partition.

You can do this by:

  • submitting your job from a directory on the /data partition;
  • using the initialdir command to tell condor where inputs and outputs are located.

Don't forget to create a directory with a name like /a/data/<server>/<username>/ before you submit your first job.

Use the vanilla environment

Unless you've specifically used the condor_compile command to compile your programs, you must submit your jobs in the "vanilla" universe. Any program that uses shared libraries cannot use condor_compile, and this includes most of the physics software at Nevis. Therefore, you are almost certainly required to have the following line in a command script:

universe = vanilla

condor log files

If you want to see the condor daemons' log files for a machine with the name hostname, look in /a/data/<hostname>/condor/log. For example:

# ls -blrth /a/data/karthur/condor/log
-rw-r--r-- 1 condor condor  153 2010-04-13 15:07 StarterLog
-rw-r--r-- 1 condor condor 473K 2010-04-13 16:29 SchedLog
-rw-r--r-- 1 root   root   591K 2010-04-13 16:29 MasterLog
-rw-r--r-- 1 root   root   788K 2010-04-13 17:15 StartLog
-rw-r--r-- 1 root   root   562K 2010-04-13 17:25 NegotiatorLog
-rw-r--r-- 1 root   root   296K 2010-04-13 17:25 CollectorLog
 

About the batch cluster

Batch manager

Line: 154 to 235
 The files on these /data partitions, like those on the /data partitions of any other systems on the Nevis cluster, are not backed up. They are stored on RAID5 arrays, which are a reliable form of storage; there is monitoring software that warns if any individual drives have failed. However, RAID arrays have been known to fail (and we've had at least one such failure at Nevis). If you have any critical data stored on these drives, make sure you backup the files yourself.

One more time: the disks on these partitions are not backed up!

Deleted:
<
<

Submitting batch jobs

Do you want 10,000 e-mails?

By default, condor will send you an e-mail message as each of your jobs completes. If you've submitted 10,000 jobs, that means 10,000 e-mails. This can clog the mail server, and make your life miserable.

Please place the following line in your condor submit file:

Notification    = Error

This means that condor will only send you an e-mail if there's an error while running the job.

Do you want to use up all your disk space?

At the end of most condor batch files, you'll see lines that look like this:

output   = mySimulation-$(Process).out
error    = mySimulation-$(Process).err
log      = mySimulation-$(Process).log

These lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. Sooner or later you'll fill up your home directory. Since you share the home directory on your server with everyone else in your working group, that means everyone in your group will be affected.

The general solution is to not write your output files into your home directory. Every workgroup server has a /data partition, which is normally several TB in size. It's a good idea to make sure your output files are written to this partition.

You can do this by:

  • submitting your job from a directory on the /data partition;
  • explicitly writing your output files to a directory on the /data partion; e.g.,
output   =  /a/data/<server>/<username>/mySimulation--$(Process).out
error    =  /a/data/<server>/<username>/mySimulation--$(Process).err
log      =  /a/data/<server>/<username>/mySimulation-$(Process).log

Don't forget to create /a/data/<server>/<username>/ before you submit your first job.

Use the vanilla environment

Unless you've specifically used the condor_compile command to compile your programs, you must submit your jobs in the "vanilla" universe. Any program that uses shared libraries cannot use condor_compile, and this includes most of the physics software at Nevis. Therefore, you are almost certainly required to have the following line in a command script:

universe = vanilla

Handling disk files

As you read through the User's Manual chapter on job submission, note that we use a shared file system at Nevis.

Because we use a shared file system at Nevis that's based on automount, it's a good idea to include the initialdir attribute in your command scripts. For example, when I submit a script that makes use of files in my directory /a/data/tanya/seligman/kit/TestArea/, I include the following line in my command script to make sure the executing machine has correctly mounted the directory:

initialdir = /a/data/tanya/seligman/kit/TestArea/ 

Important: Think about how you transfer your job's files

Picture this: You submit a condor batch procedure that runs thousands of jobs. Each one of those jobs reads and/or writes directly into a directory on your server, accessed via NFS. Let's say there are 200 batch queues in the Nevis condor system. That means that 200 jobs are trying to access the disk on your server at once.

Sounds like a recipe for disaster, doesn't it? You can crash your server by writing 200 files at once via NFS. It's happened several times at Nevis.

To solve this problem, don't rely on your job reading and writing files directly to a particular directory. Use commands like the following in your condor command file; look them up in the User's Manual:

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = ...list of input files...

This will transfer your input files to the condor master server once, instead of 200 times; as your job is executing, it will write the output on a local area of the machine that's running the job. Once the job has finished executing, it will transfer the output file to the directory from which you submitted the job.

Result: you don't crash your server. You also don't clog up the network with unnecessary file transfers.

sh-style shells versus csh-style shells

We have discovered that the vanilla environment described in the Condor manual does not behave exactly as documented at Nevis. The following advice may be helpful:

There appears to be a difference in the way the sh and csh shells handle files in Condor. In sh, bash, or zsh (the default at Nevis) the examples in the Condor manual basically work as they are. In csh or tcsh, the scripts will fail due to "file not found" errors unless you do one of the following:

  • Transfer all working files to the executing machine with the lines:
should_transfer_files = YES 
when_to_transfer_output = ON_EXIT 

  • Include the full path name when you reference any file in your command script. For example, in my scripts, the following line fails:
executable = athena.csh
but the following succeeds:
executable = /a/data/tanya/seligman/kit/TestArea/athena.csh

condor log files

If you want to see the condor daemons' log files for a machine with the name hostname, look in /a/data/<hostname>/condor/log. For example:

# ls -blrth /a/data/karthur/condor/log
-rw-r--r-- 1 condor condor  153 2010-04-13 15:07 StarterLog
-rw-r--r-- 1 condor condor 473K 2010-04-13 16:29 SchedLog
-rw-r--r-- 1 root   root   591K 2010-04-13 16:29 MasterLog
-rw-r--r-- 1 root   root   788K 2010-04-13 17:15 StartLog
-rw-r--r-- 1 root   root   562K 2010-04-13 17:25 NegotiatorLog
-rw-r--r-- 1 root   root   296K 2010-04-13 17:25 CollectorLog
 \ No newline at end of file
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback