Batch Services at Nevis

This is a description of the batch job submission services available on the Linux cluster at Nevis Labs.

Stop. Read this First.

You have a program and perhaps a script. You just want to submit it and start thinking about physics again. But before you use condor, you have to think about resource management: How does condor know which files you'll need for input? How will the output files be transferred?

Although the section on Resource Management is in the middle of this page, where it fits logically, it's the most important aspect of your condor job. Once you understand the concepts, the rest of condor is relatively easy.

Getting started

The batch job submission system we're using at Nevis is Condor, developed at the University of Wisconsin. You can learn more about Condor from the User's Manual.

The standard condor examples

If you're just starting to learn Condor, a good way to start is to copy the Condor examples:

cp -arv /usr/share/doc/condor-*/examples .
cd examples 

Read the README file; type make to compile the programs; type sh submit to submit a few test jobs.

You may notice that the sh_loop script will not execute; it will sit in the "Idle" state indefinitely. It won't execute unless you submit it in the vanilla universe as described below.

Examples that incorporate the tips on this page

Many of the following tips have been combined into a set of example scripts. The Athena-related scripts are in ~seligman/condor/; start with the README file, which will point you to the other relevant files in the directory. Note that those examples were prepared in 2005.

Submitting multiple jobs with one condor_submit command

An ATLAS example: Running Multiple Jobs On Condor

As of Jun-2008, you can find several examples of multiple job submission in /a/home/riverside/seligman/nusong/aria/work; these go further with the tips in the above link, to generate both numeric and text parameters that vary according to condor's process ID. Look in the *.cmd files, which will lead you in turn to some of the *.sh files in that directory. There are hopefully enough comments in those scripts to get you started.

Resource Management

Important: Think about how you transfer your job's files

Picture this: You submit a condor batch procedure that runs thousands of jobs. Each one of those jobs reads and/or writes directly into a directory on your server, accessed via an automount path like /a/data/tanya; this means you're using NFS. Let's say there are 300 batch queues in the Nevis condor system. That means that 300 jobs are trying to access the disk on your server at once.

Sounds like a recipe for disaster, doesn't it? You can crash your server by writing 300 files at once via NFS. It's happened several times at Nevis.

To partially solve this problem, the condor batch nodes are blocked from writing to the /home and /data partitions on the servers.

In order to get around this, don't rely on your job reading and writing files directly to a particular directory. Use commands like the following in your condor command file; look them up in the User's Manual:

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = ...list of input files...
initialdir = ...where your inputs and outputs are located...

This will transfer your input files to the condor master server once, instead of 300 times; as your job is executing, it will write the output on a local area of the machine that's running the job. Once the job has finished executing, it will transfer the output file to the initialdir directory.

Unless you specify a file's full pathname in the condor command file, the files will be copied to and from initialdir (see below).

There's one more thing to think about: If initialdir is located machine A (via automount), and you submit the job from machine B, then condor will use machine B to transfer the files to the directory on machine A. For example, if these lines are in your condor submit file:

initialdir = /a/data/kolya/jsmith
queue 10000
and you submit the job on the machine karthur, then as each of the 10,000 jobs terminates karthur will automount /a/data/kolya/jsmith on kolya to write the file; see condor_shadow for more information. This has not yet caused any machines at Nevis to crash, but it has caused both machines to become annoyingly slow.

Memory limits

The systems on the condor batch cluster have enough RAM for 1GB/processing queue. This means if your job uses more than 1GB of memory, there can be a problem. For example, if your job required 2GB of memory, and a condor batch node had 16 queues, then your 16 jobs will require 32GB of RAM, twice as much as the machine has. The machine will start swapping memory pages continuously, and essentially halt.

To keep this from happening, condor will automatically cancel a job that requires more than 1GB of RAM. Unfortunately, condor has a problem estimating the amount of memory required by a running job: if a program uses threads, it will tend to overestimate; if a program uses shared libraries, it tends to underestimate.

Therefore, if you find that your large simulation program is being "spontaneously" canceled, look at its memory use.

Submitting batch jobs

Do you want 10,000 e-mails?

By default, condor will send you an e-mail message as each of your jobs completes. If you've submitted 10,000 jobs, that means 10,000 e-mails. This can clog the mail server, and make your life miserable.

Please place the following line in your condor submit file:

Notification    = Error

This means that condor will only send you an e-mail if there's an error while running the job.

Do you want to use up all your disk space?

At the end of most condor batch files, you'll see lines that look like this:

output   = mySimulation-$(Process).out
error    = mySimulation-$(Process).err
log      = mySimulation-$(Process).log

These lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. If initialdir points to a sub-directory of your home directory, sooner or later you'll fill up the /home partition on your server. Everyone in your group will be affected.

The general solution is to not write your output files into your home directory. Every workgroup server has a /data partition which is several TB in size. It's a good idea to make sure your output files are written to this partition.

You can do this via one of the following:

  • submit your job from a directory on the /data partition; e.g., /a/data/<server>/<username>/
  • use the initialdir command to tell condor where inputs and outputs are located.

Use the vanilla environment

Unless you've specifically used the condor_compile command to compile your programs, you must submit your jobs in the "vanilla" universe. Any program that uses shared libraries cannot use condor_compile, and this includes most of the physics software at Nevis. Therefore, you are almost certainly required to have the following line in a command script:

universe = vanilla

condor log files

If you want to see the condor daemons' log files for a machine with the name hostname, look in /a/data/<hostname>/condor/log. For example:

# ls -blrth /a/data/karthur/condor/log
-rw-r--r-- 1 condor condor  153 2010-04-13 15:07 StarterLog
-rw-r--r-- 1 condor condor 473K 2010-04-13 16:29 SchedLog
-rw-r--r-- 1 root   root   591K 2010-04-13 16:29 MasterLog
-rw-r--r-- 1 root   root   788K 2010-04-13 17:15 StartLog
-rw-r--r-- 1 root   root   562K 2010-04-13 17:25 NegotiatorLog
-rw-r--r-- 1 root   root   296K 2010-04-13 17:25 CollectorLog

About the batch cluster

Batch manager

The system responsible for administering batch services is condor.nevis.columbia.edu. Users typically do not log in to this machine directly; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of condor.nevis.columbia.edu may be completely transparent to you.

Condor status and usage

You can see how much of the batch cluster is in use, and by whom:

Fair use

The condor system is most efficient when it's handling a large number of small jobs. Long jobs tend to clog up the queues, and prevent others from doing their work.

As of Feb-2010, there is no system that gives some groups or users higher priority than others. However, condor comes with a default scheme for adjusting user priorities; here are the details.

The practical upshots of condor's default priority scheme:

  • If you use condor a lot, other users will tend to get higher priority when they submit jobs.
  • If your job takes more than an hour to run, there's a chance it will be pre-empted; that chance increases the longer the job runs.

If you use the vanilla environment (see below), as most users at Nevis must, for a job to be "pre-empted" means that it is killed, and must start again from the beginning.

To get an idea of your user resource consumption and how it compares to other users, use these commands:

condor_userprio -allusers 
The larger the number, the lower your priority in comparison to the other users listed.

What processing power is available

The following commands will show you the machines available to run your jobs, their status, and their resources:

condor_status 
condor_status -server 

Obviously, some machines are more powerful than others. Before you arbitrarily decide that only the most powerful machines are good enough for your jobs, consider:

  • It's true, a machine that's 1/4 as fast will take 4 times as long to execute your jobs. However, the demand for the faster machine may be more than four times as much; it's possible that your job will sit waiting in the queue longer than it would have taken to run on the slower box.

  • The CPU cycles on the slower machines are presently being wasted. You might be able to put them to some use.

  • If you have a large number of jobs to submit, the slower machine can chug away at a couple of them while the rest are waiting to execute on the faster processors.

The best way to tell Condor that you'd prefer your job to execute on the faster machines is to use the Rank attribute in your submit file:

Rank = Mips

With all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your submit file:

Requirements = (Mips > 2000) 

This would restrict your job to the fastest processors on the cluster.

All the machines on the batch farm are not the same

The batch farm is a heterogeneous collection of machines. If you're having problems with programs crashing on some systems but not on others, please read this page on compiler tools that can help solve this problem.

"Why isn't my job running on all the machines in the batch farm?"

There may be several reasons:

The heterogenous cluster

Not all machines in the farm are the same; they have different amounts of memory, disk space, and occasionally even installed libraries. Make sure you use the standardized compiler to compile your programs.

You'll also want to set up the standard Nevis environment explicitly in your jobs. If you look at the example described below, you'll see that the shell scripts all contain command such as:

# If you're using bash:
shopt -s expand_aliases
source /usr/nevis/adm/nevis-init.sh
setup root geant4

Finally, don't forget to set initialdir in your condor submit file.

The job requirements

There may be something explicit or implicit in the resources required to run your job. To pick an unrealistic example, if you job requires ksh and that shell isn't installed on machine, then it won't execute on the cluster. A more practical example: If you have the following in your job submit file:

Requirements = ( Memory > 1024 )
then your job won't execute if the amount of memory per job queue is 1024 or less, including those machines with 1023 MB per queue to due rounding in the memory calculation.

If you think your job with ID 4402 should be able to execute on machine batch04, you can compare what condor thinks are the job's requirements against what the machine offers:

condor_q -long -global 4402
condor_status -long batch04

Another clue can come from using condor_q. If you have a job held with an ID of 44.20:

condor_q -analyze 44.20

Suspended jobs

As noted elsewhere on this page, we generally use the vanilla universe at Nevis. This means if a job is suspended on a given machine, it can only continue on that particular machine. If that machine is running other jobs, then the suspended job must wait.

Extra disk space

In addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They were initially used by the ATLAS and D0 groups, but can be made available to other groups as the need arises. These disks are available via automount on the Linux cluster; each has a capacity of about 1.5TB.

The names of these RAID arrays are:

  • /a/data/condor/array1/
  • /a/data/condor/array2/

For example, the permissions on the drives have been set so that you can do the following from any machine on the Linux cluster (if you're a member of the ATLAS group):

cd /a/data/condor/array2/atlas/
mkdir $user
cd $user # ... create whatever files you want 

Important! If you're skimming this page, stop and read the following paragraph!

The files on these /data partitions, like those on the /data partitions of any other systems on the Nevis cluster, are not backed up. They are stored on RAID5 arrays, which are a reliable form of storage; there is monitoring software that warns if any individual drives have failed. However, RAID arrays have been known to fail (and we've had at least one such failure at Nevis). If you have any critical data stored on these drives, make sure you backup the files yourself.

One more time: the disks on these partitions are not backed up!


This topic: Main > Computing > LinuxCluster > Condor
Topic revision: r26 - 2011-07-05 - WilliamSeligman
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback