Difference: Condor (26 vs. 27)

Revision 272011-07-06 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="LinuxCluster"

Batch Services at Nevis

Line: 6 to 6
 
Changed:
<
<
This is a description of the batch job submission services available on the Linux cluster at Nevis Labs.
>
>
Stop. Read the page on disk sharing.
 
Changed:
<
<
Stop. Read this First.

You have a program and perhaps a script. You just want to submit it and start thinking about physics again. But before you use condor, you have to think about resource management: How does condor know which files you'll need for input? How will the output files be transferred?

Although the section on Resource Management is in the middle of this page, where it fits logically, it's the most important aspect of your condor job. Once you understand the concepts, the rest of condor is relatively easy.

>
>
Yeah, it's a pain: you have to read both this page and a whole separate page on disk management. But once you understand the concept of organizing the resources for your job, the rest of condor is relatively easy.
 

Getting started

Line: 39 to 35
  As of Jun-2008, you can find several examples of multiple job submission in /a/home/riverside/seligman/nusong/aria/work; these go further with the tips in the above link, to generate both numeric and text parameters that vary according to condor's process ID. Look in the *.cmd files, which will lead you in turn to some of the *.sh files in that directory. There are hopefully enough comments in those scripts to get you started.
Changed:
<
<

Resource Management

Important: Think about how you transfer your job's files

Picture this: You submit a condor batch procedure that runs thousands of jobs. Each one of those jobs reads and/or writes directly into a directory on your server, accessed via an automount path like /a/data/tanya; this means you're using NFS. Let's say there are 300 batch queues in the Nevis condor system. That means that 300 jobs are trying to access the disk on your server at once.

Sounds like a recipe for disaster, doesn't it? You can crash your server by writing 300 files at once via NFS. It's happened several times at Nevis.

To partially solve this problem, the condor batch nodes are blocked from writing to the /home and /data partitions on the servers.

In order to get around this, don't rely on your job reading and writing files directly to a particular directory. Use commands like the following in your condor command file; look them up in the User's Manual:

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = ...list of input files...
initialdir = ...where your inputs and outputs are located...

This will transfer your input files to the condor master server once, instead of 300 times; as your job is executing, it will write the output on a local area of the machine that's running the job. Once the job has finished executing, it will transfer the output file to the initialdir directory.

Unless you specify a file's full pathname in the condor command file, the files will be copied to and from initialdir (see below).

There's one more thing to think about: If initialdir is located machine A (via automount), and you submit the job from machine B, then condor will use machine B to transfer the files to the directory on machine A. For example, if these lines are in your condor submit file:

initialdir = /a/data/kolya/jsmith
queue 10000
and you submit the job on the machine karthur, then as each of the 10,000 jobs terminates karthur will automount /a/data/kolya/jsmith on kolya to write the file; see condor_shadow for more information. This has not yet caused any machines at Nevis to crash, but it has caused both machines to become annoyingly slow.
>
>

Submitting batch jobs

 

Memory limits

Line: 76 to 45
  Therefore, if you find that your large simulation program is being "spontaneously" canceled, look at its memory use.
Deleted:
<
<

Submitting batch jobs

 

Do you want 10,000 e-mails?

Changed:
<
<
By default, condor will send you an e-mail message as each of your jobs completes. If you've submitted 10,000 jobs, that means 10,000 e-mails. This can clog the mail server, and make your life miserable.

Please place the following line in your condor submit file:

>
>
By default, condor will send you an e-mail message as each of your jobs completes. If you've submitted 10,000 jobs, that means 10,000 e-mails. This can clog the mail server, and make your life miserable. Therefore, the following has been made default at Nevis:
 
Notification    = Error
Changed:
<
<
This means that condor will only send you an e-mail if there's an error while running the job.

Do you want to use up all your disk space?

At the end of most condor batch files, you'll see lines that look like this:

output   = mySimulation-$(Process).out
error    = mySimulation-$(Process).err
log      = mySimulation-$(Process).log

These lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. If initialdir points to a sub-directory of your home directory, sooner or later you'll fill up the /home partition on your server. Everyone in your group will be affected.

The general solution is to not write your output files into your home directory. Every workgroup server has a /data partition which is several TB in size. It's a good idea to make sure your output files are written to this partition.

You can do this via one of the following:

  • submit your job from a directory on the /data partition; e.g., /a/data/<server>/<username>/
  • use the initialdir command to tell condor where inputs and outputs are located.
>
>
This means that condor will only send you an e-mail if there's an error while running the job. Don't override it!
 

Use the vanilla environment

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback