Difference: Condor (23 vs. 24)

Revision 242011-01-04 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="LinuxCluster"

Batch Services at Nevis

Line: 10 to 10
  Stop. Read this First.
Changed:
<
<
You have a program and perhaps a script. You just want to submit it and start thinking about physics again. But before you use condor, you have to think about resource management: How does condor know which files you'll need for input? Where will the files go on output?
>
>
You have a program and perhaps a script. You just want to submit it and start thinking about physics again. But before you use condor, you have to think about resource management: How does condor know which files you'll need for input? How will the output files be transferred?
  Although the section on Resource Management is in the middle of this page, where it fits logically, it's the most important aspect of your condor job. Once you understand the concepts, the rest of condor is relatively easy.
Line: 61 to 61
  Unless you specify a file's full pathname in the condor command file, the files will be copied to and from initialdir (see below).
Added:
>
>
There's one more thing to think about: If initialdir is located machine A (via automount), and you submit the job from machine B, then condor will use machine B will be used to transfer the files to the directory on machine A. For example, if these lines are in your condor submit file:
initialdir = /a/data/kolya/jsmith
queue 10000
and you submit the job on the machine karthur, then as each of the 10,000 jobs terminates karthur will automount /a/data/kolya/jsmith on kolya to write the file; see condor_shadow for more information. This has not yet caused any machines at Nevis to crash, but it has caused both machines to become annoyingly slow.
 

Memory limits

The systems on the condor batch cluster have enough RAM for 1GB/processing queue. This means if your job uses more than 1GB of memory, there can be a problem. For example, if your job required 2GB of memory, and a condor batch node had 16 queues, then your 16 jobs will require 32GB of RAM, twice as much as the machine has. The machine will start swapping memory pages continuously, and essentially halt.

Line: 90 to 97
 log = mySimulation-$(Process).log
Changed:
<
<
These lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. If initialdir points to a sub-directory of your home directory, sooner or later you'll fill up the /home partition on your server. That means everyone in your group will be affected.
>
>
These lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. If initialdir points to a sub-directory of your home directory, sooner or later you'll fill up the /home partition on your server. Everyone in your group will be affected.
 
Changed:
<
<
The general solution is to not write your output files into your home directory. Every workgroup server has a /data partition, which is normally several TB in size. It's a good idea to make sure your output files are written to this partition.
>
>
The general solution is to not write your output files into your home directory. Every workgroup server has a /data partition which is several TB in size. It's a good idea to make sure your output files are written to this partition.
  You can do this via one of the following:
Changed:
<
<
  • submitting your job from a directory on the /data partition; e.g., /a/data/<server>/<username>/
  • using the initialdir command to tell condor where inputs and outputs are located.
>
>
  • submit your job from a directory on the /data partition; e.g., /a/data/<server>/<username>/
  • use the initialdir command to tell condor where inputs and outputs are located.
 

Use the vanilla environment

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback