Condor Basic
This page describes some of the basics for setting up jobs on the particle-physics computer cluster at Nevis.
Warning: Using condor is not trivial. You'll have to learn quite a few
details and about
disk sharing. What follows are a few basic concepts, but it is
not enough to get you started.
Documentation
Condor
was developed at the University of Wisconsin. Here is the
User's Manual
.
Steps
There are usually three steps to developing a program to submit to condor.
Program
This is the code that you want condor to execute. I assume you've developed the program interactively, but you now want to automate its execution for condor. You probably don't have to change the program itself, but you may have to move the executable and any libraries to a disk that's
visible to the condor batch system.
Script
In theory, you can run a program directly without a script; most of the examples in
/usr/share/doc/condor-*/examples
do this.
In practice, programs in physics typically need scripts to organize a program's execution environment. A shell script would invoke the "setup" commands or scripts needed to run the program; if you have to type
source my-experiment-setup-sh
before running your program, you'd put that command in a shell script.
Don't forget to make the shell script executable; e.g.,
chmod +x myscript.sh
.
Scripts can become "mini-programs" themselves, as in this
example. For example, in a simulation script I wrote, the shell script determined the particle ID to be input into the Monte Carloo, the energy of that particle, the random number seeds, and determined a unique name for the output file.
Command file
Condor requires that jobs be submitted via a condor command file; e.g.,
condor_submit mycommands.cmd
. This command file tells condor the script to execute, what files to copy, and where to put the program's output. The command file also tells condor how many copies of the program to run; that's how you submit 1000 jobs with a single command.
Batch Clusters
There is more than one separate particle-physics batch cluster at Nevis, due to the different analysis requirements of some groups:
The cluster which executes a job is determined by the machine on which you issue the
condor_submit
command. For example, if you submit a job from a Neutrino system, it will run on the Neutrino cluster; if you submit a job from
kolya
or
karthur
, it runs on the general cluster; if you submit a job from
xenia
, it runs on the ATLAS cluster.
Where to learn
Let's start with the obvious: If someone else in your group has a set of condor scripts that work at Nevis, copy them! If you have to write your own:
The standard condor examples
One good way to start is to copy the Condor examples:
cp -arv /usr/share/doc/condor-*/examples .
cd examples
Read the README file; type
make
to compile the programs; type
sh submit
to submit a few test jobs.
You may notice that the sh_loop script will not execute; it will sit in the "Idle" state indefinitely. It won't execute unless you submit it in the
vanilla
universe; see
batch details.
Other programs in the examples may not work either. Look at the output, error, and log files; search the web for any error messages. This will provide experience when your "real" jobs begin to fail.
Some practical examples
Many of the
details have been combined into a set of example scripts. The Athena-related scripts are in
~seligman/condor/
; start with the README file, which will point you to the other relevant files in the directory. Note that these examples were prepared in 2005, before we figured out how to do
disk sharing properly.
Submitting multiple jobs with one condor_submit
command
An ATLAS example:
Running Multiple Jobs On Condor
As of Jun-2008, you can find several examples of multiple job submission in
/a/home/houston/seligman/nusong/aria/work
; these go further with the tips in the above link, to generate both numeric and text parameters that vary according to condor's process ID. Look in the
*.cmd
files, which will lead you in turn to some of the
*.sh
files in that directory. There are hopefully enough comments in those scripts to get you started. Again, these examples were were written before we figured out how to do
disk sharing.
What's next
Now look at the pages on
disk sharing and
batch details. This will help you create scripts and command files that work in the current Nevis environment. Once you understand the concept of organizing the resources for your job, the rest of condor is relatively easy.
Good luck!