Condor Basics

This page describes some of the basics for setting up jobs on the particle-physics computer cluster at Nevis.

For a quick introduction to condor at Nevis, see the documents on batch processing and condor tutorial that are on WilliamSeligman's ROOT tutorial page.

Warning: Using condor is not trivial. You'll have to learn quite a few details and about disk sharing. What follows are a few basic concepts, but it is not enough to get you started.

Documentation

Condor was developed at the University of Wisconsin. Here is the User's Manual.

Steps

There are usually three steps to developing a program to submit to condor.

Program

This is the code that you want condor to execute. I assume you've developed the program interactively, but you now want to automate its execution for condor. You probably don't have to change the program itself, but you may have to move the executable and any libraries to a disk that's visible to the condor batch system.

Script

In theory, you can run a program directly without a script; most of the examples in /usr/share/doc/condor-*/examples do this.

In practice, programs in physics typically need scripts to organize a program's execution environment. A shell script would invoke the module load commands or scripts needed to run the program; if you have to type source my-experiment-setup-sh before running your program, you'd put that command in a shell script.

Don't forget to make the shell script executable; e.g., chmod +x myscript.sh.

Scripts can become "mini-programs" themselves, as in this example. For example, in a simulation script I wrote, the shell script determined the particle ID to be input into the Monte Carlo, the energy of that particle, the random number seeds, and determined a unique name for the output file.

Command file

Condor requires that jobs be submitted via a condor command file; e.g., condor_submit mycommands.cmd. This command file tells condor the script to execute, what files to copy, and where to put the program's output. The command file also tells condor how many copies of the program to run; that's how you submit 1000 jobs with a single command.

Batch Clusters

There is more than one separate particle-physics batch cluster at Nevis, due to the different analysis requirements of some groups:

The cluster which executes a job is determined by the machine on which you issue the condor_submit command. For example, if you submit a job from a Neutrino system, it will run on the Neutrino cluster; if you submit a job from kolya or karthur, it runs on the general cluster; if you submit a job from xenia, it runs on the ATLAS cluster.

Where to learn

Let's start with the obvious: If someone else in your group has a set of condor scripts that work at Nevis, copy them! If you have to write your own:

The standard condor examples

One way to start is to copy the Condor examples:

cp -arv /usr/share/doc/condor-*/examples .
cd examples 

Read the README file; type make to compile the programs; type sh submit to submit a few test jobs. Note that these examples are several years old, and you may have to do some debugging to get the compilation process to work.

You may notice that the sh_loop script will not execute; it will sit in the "Idle" state indefinitely. It won't execute unless you submit it in the vanilla universe; see batch details.

Other programs in the examples may not work either. Look at the output, error, and log files; search the web for any error messages. This will provide experience when your "real" jobs begin to fail.

Some practical examples

Many of the details have been combined into a set of example scripts. The Athena-related scripts are in ~seligman/condor/; start with the README file, which will point you to the other relevant files in the directory. Note that these examples were prepared in 2005, before we figured out how to do disk sharing properly.

Submitting multiple jobs with one condor_submit command

An ATLAS example: Running Multiple Jobs On Condor

As of Jun-2008, you can find several examples of multiple job submission in /a/home/houston/seligman/nusong/aria/work; these go further with the tips in the above link, to generate both numeric and text parameters that vary according to condor's process ID. Look in the *.cmd files, which will lead you in turn to some of the *.sh files in that directory. There are hopefully enough comments in those scripts to get you started. Again, these examples were were written before we figured out how to do disk sharing.

What's next

Now look at the pages on disk sharing and batch details. This will help you create scripts and command files that work in the current Nevis environment. Once you understand the concept of organizing the resources for your job, the rest of condor is relatively easy.

Good luck!

Edit | Attach | Watch | Print version | History: r34 < r33 < r32 < r31 < r30 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r34 - 2018-05-25 - WilliamSeligman
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback