Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 6 to 6 | ||||||||
Changed: | ||||||||
< < | This is a description of the batch job submission services available on the Linux cluster at Nevis Labs![]() | |||||||
> > | Stop. Read the page on disk sharing. | |||||||
Changed: | ||||||||
< < | Stop. Read this First. You have a program and perhaps a script. You just want to submit it and start thinking about physics again. But before you use condor, you have to think about resource management: How does condor know which files you'll need for input? How will the output files be transferred? Although the section on Resource Management is in the middle of this page, where it fits logically, it's the most important aspect of your condor job. Once you understand the concepts, the rest of condor is relatively easy. | |||||||
> > | Yeah, it's a pain: you have to read both this page and a whole separate page on disk management. But once you understand the concept of organizing the resources for your job, the rest of condor is relatively easy. | |||||||
Getting started | ||||||||
Line: 39 to 35 | ||||||||
As of Jun-2008, you can find several examples of multiple job submission in /a/home/riverside/seligman/nusong/aria/work ; these go further with the tips in the above link, to generate both numeric and text parameters that vary according to condor's process ID. Look in the *.cmd files, which will lead you in turn to some of the *.sh files in that directory. There are hopefully enough comments in those scripts to get you started. | ||||||||
Changed: | ||||||||
< < | Resource ManagementImportant: Think about how you transfer your job's filesPicture this: You submit a condor batch procedure that runs thousands of jobs. Each one of those jobs reads and/or writes directly into a directory on your server, accessed via an automount path like/a/data/tanya ; this means you're using NFS. Let's say there are 300 batch queues in the Nevis condor system. That means that 300 jobs are trying to access the disk on your server at once.
Sounds like a recipe for disaster, doesn't it? You can crash your server by writing 300 files at once via NFS. It's happened several times at Nevis.
To partially solve this problem, the condor batch nodes are blocked from writing to the /home and /data partitions on the servers.
In order to get around this, don't rely on your job reading and writing files directly to a particular directory. Use commands like the following in your condor command file; look them up in the User's Manual![]() should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = ...list of input files... initialdir = ...where your inputs and outputs are located...This will transfer your input files to the condor master server once, instead of 300 times; as your job is executing, it will write the output on a local area of the machine that's running the job. Once the job has finished executing, it will transfer the output file to the initialdir ![]() ![]() initialdir is located machine A (via automount), and you submit the job from machine B, then condor will use machine B to transfer the files to the directory on machine A. For example, if these lines are in your condor submit file:
initialdir = /a/data/kolya/jsmith queue 10000and you submit the job on the machine karthur, then as each of the 10,000 jobs terminates karthur will automount /a/data/kolya/jsmith on kolya to write the file; see condor_shadow![]() | |||||||
> > | Submitting batch jobs | |||||||
Memory limits | ||||||||
Line: 76 to 45 | ||||||||
Therefore, if you find that your large simulation program is being "spontaneously" canceled, look at its memory use. | ||||||||
Deleted: | ||||||||
< < | Submitting batch jobs | |||||||
Do you want 10,000 e-mails? | ||||||||
Changed: | ||||||||
< < | By default, condor will send you an e-mail message as each of your jobs completes. If you've submitted 10,000 jobs, that means 10,000 e-mails. This can clog the mail server, and make your life miserable. Please place the following line in your condor submit file: | |||||||
> > | By default, condor will send you an e-mail message as each of your jobs completes. If you've submitted 10,000 jobs, that means 10,000 e-mails. This can clog the mail server, and make your life miserable. Therefore, the following has been made default at Nevis: | |||||||
Notification = Error | ||||||||
Changed: | ||||||||
< < | This means that condor will only send you an e-mail if there's an error while running the job.
Do you want to use up all your disk space?At the end of most condor batch files, you'll see lines that look like this:output = mySimulation-$(Process).out error = mySimulation-$(Process).err log = mySimulation-$(Process).logThese lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. If initialdir ![]() /home partition on your server. Everyone in your group will be affected.
The general solution is to not write your output files into your home directory. Every workgroup server has a /data partition which is several TB in size. It's a good idea to make sure your output files are written to this partition.
You can do this via one of the following:
| |||||||
> > | This means that condor will only send you an e-mail if there's an error while running the job. Don't override it! | |||||||
Use the vanilla environment |