Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Condor Basics | ||||||||
Line: 8 to 8 | ||||||||
This page describes some of the basics for setting up jobs on the particle-physics computer cluster at Nevis. | ||||||||
Added: | ||||||||
> > | For a quick introduction to condor at Nevis, see the documents on batch processing and condor tutorial that are on WilliamSeligman's ROOT tutorial![]() | |||||||
Warning: Using condor is not trivial. You'll have to learn quite a few details and about disk sharing. What follows are a few basic concepts, but it is not enough to get you started.
Documentation | ||||||||
Line: 51 to 53 | ||||||||
The standard condor examples | ||||||||
Changed: | ||||||||
< < | One good way to start is to copy the Condor examples: | |||||||
> > | One way to start is to copy the Condor examples: | |||||||
cp -arv /usr/share/doc/condor-*/examples . cd examples | ||||||||
Changed: | ||||||||
< < | Read the README file; type make to compile the programs; type sh submit to submit a few test jobs. | |||||||
> > | Read the README file; type make to compile the programs; type sh submit to submit a few test jobs. Note that these examples are several years old, and you may have to do some debugging to get the compilation process to work. | |||||||
You may notice that the sh_loop script will not execute; it will sit in the "Idle" state indefinitely. It won't execute unless you submit it in the vanilla universe; see batch details. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Condor Basics | ||||||||
Line: 26 to 26 | ||||||||
In theory, you can run a program directly without a script; most of the examples in /usr/share/doc/condor-*/examples do this. | ||||||||
Changed: | ||||||||
< < | In practice, programs in physics typically need scripts to organize a program's execution environment. A shell script would invoke the "setup" commands or scripts needed to run the program; if you have to type source my-experiment-setup-sh before running your program, you'd put that command in a shell script. | |||||||
> > | In practice, programs in physics typically need scripts to organize a program's execution environment. A shell script would invoke the module load commands or scripts needed to run the program; if you have to type source my-experiment-setup-sh before running your program, you'd put that command in a shell script. | |||||||
Don't forget to make the shell script executable; e.g., chmod +x myscript.sh . | ||||||||
Changed: | ||||||||
< < | Scripts can become "mini-programs" themselves, as in this example![]() | |||||||
> > | Scripts can become "mini-programs" themselves, as in this example![]() | |||||||
Command file |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Condor Basics | ||||||||
Line: 12 to 12 | ||||||||
Documentation | ||||||||
Changed: | ||||||||
< < | Condor![]() ![]() | |||||||
> > | Condor![]() ![]() | |||||||
Steps |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Changed: | ||||||||
< < | Condor Basic | |||||||
> > | Condor Basics | |||||||
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Condor Basic | ||||||||
Line: 30 to 30 | ||||||||
Don't forget to make the shell script executable; e.g., chmod +x myscript.sh . | ||||||||
Changed: | ||||||||
< < | Scripts can become "mini-programs" themselves, as in this example. For example, in a simulation script I wrote, the shell script determined the particle ID to be input into the Monte Carloo, the energy of that particle, the random number seeds, and determined a unique name for the output file. | |||||||
> > | Scripts can become "mini-programs" themselves, as in this example![]() | |||||||
Command file |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Changed: | ||||||||
< < | Batch Services at Nevis | |||||||
> > | Condor Basic | |||||||
Changed: | ||||||||
< < | Stop. Read the page on disk sharing. | |||||||
> > | This page describes some of the basics for setting up jobs on the particle-physics computer cluster at Nevis. | |||||||
Changed: | ||||||||
< < | Yeah, it's a pain: you have to read both this page and a whole separate page on disk management. But once you understand the concept of organizing the resources for your job, the rest of condor is relatively easy. | |||||||
> > | Warning: Using condor is not trivial. You'll have to learn quite a few details and about disk sharing. What follows are a few basic concepts, but it is not enough to get you started. | |||||||
Changed: | ||||||||
< < | Getting started | |||||||
> > | Documentation | |||||||
Changed: | ||||||||
< < | The batch job submission system we're using at Nevis is Condor![]() ![]() | |||||||
> > | Condor![]() ![]() | |||||||
Changed: | ||||||||
< < | The standard condor examplesIf you're just starting to learn Condor, a good way to start is to copy the Condor examples:cp -arv /usr/share/doc/condor-*/examples . cd examplesRead the README file; type make to compile the programs; type sh submit to submit a few test jobs.
You may notice that the sh_loop script will not execute; it will sit in the "Idle" state indefinitely. It won't execute unless you submit it in the vanilla universe as described below.
Examples that incorporate the tips on this pageMany of the following tips have been combined into a set of example scripts. The Athena-related scripts are in~seligman/condor/ ; start with the README file, which will point you to the other relevant files in the directory. Note that those examples were prepared in 2005.
Submitting multiple jobs with one
An ATLAS example: Running Multiple Jobs On Condor | |||||||
> > | Steps | |||||||
Changed: | ||||||||
< < | Submitting batch jobs | |||||||
> > | There are usually three steps to developing a program to submit to condor. | |||||||
Changed: | ||||||||
< < | Memory limits | |||||||
> > | Program | |||||||
Changed: | ||||||||
< < | The systems on the condor batch cluster have enough RAM for 1GB/processing queue. This means if your job uses more than 1GB of memory, there can be a problem. For example, if your job required 2GB of memory, and a condor batch node had 16 queues, then your 16 jobs will require 32GB of RAM, twice as much as the machine has. The machine will start swapping memory pages continuously, and essentially halt. | |||||||
> > | This is the code that you want condor to execute. I assume you've developed the program interactively, but you now want to automate its execution for condor. You probably don't have to change the program itself, but you may have to move the executable and any libraries to a disk that's visible to the condor batch system. | |||||||
Changed: | ||||||||
< < | To keep this from happening, condor will automatically cancel a job that requires more than 1GB of RAM. Unfortunately, condor has a problem estimating the amount of memory required by a running job: if a program uses threads, it will tend to overestimate; if a program uses shared libraries, it tends to underestimate. | |||||||
> > | Script | |||||||
Changed: | ||||||||
< < | Therefore, if you find that your large simulation program is being "spontaneously" canceled, look at its memory use. | |||||||
> > | In theory, you can run a program directly without a script; most of the examples in /usr/share/doc/condor-*/examples do this. | |||||||
Changed: | ||||||||
< < | Do you want 10,000 e-mails? | |||||||
> > | In practice, programs in physics typically need scripts to organize a program's execution environment. A shell script would invoke the "setup" commands or scripts needed to run the program; if you have to type source my-experiment-setup-sh before running your program, you'd put that command in a shell script. | |||||||
Changed: | ||||||||
< < | By default, condor will send you an e-mail message as each of your jobs completes. If you've submitted 10,000 jobs, that means 10,000 e-mails. This can clog the mail server, and make your life miserable. Therefore, the following has been made default at Nevis: | |||||||
> > | Don't forget to make the shell script executable; e.g., chmod +x myscript.sh . | |||||||
Changed: | ||||||||
< < | Notification = Error | |||||||
> > | Scripts can become "mini-programs" themselves, as in this example. For example, in a simulation script I wrote, the shell script determined the particle ID to be input into the Monte Carloo, the energy of that particle, the random number seeds, and determined a unique name for the output file. | |||||||
Changed: | ||||||||
< < | This means that condor will only send you an e-mail if there's an error while running the job. Don't override it! | |||||||
> > | Command file | |||||||
Changed: | ||||||||
< < | Use the vanilla environment | |||||||
> > | Condor requires that jobs be submitted via a condor command file; e.g., condor_submit mycommands.cmd . This command file tells condor the script to execute, what files to copy, and where to put the program's output. The command file also tells condor how many copies of the program to run; that's how you submit 1000 jobs with a single command. | |||||||
Changed: | ||||||||
< < | Unless you've specifically used the condor_compile![]() | |||||||
> > | Batch Clusters | |||||||
Changed: | ||||||||
< < | universe = vanilla | |||||||
> > | There is more than one separate particle-physics batch cluster at Nevis, due to the different analysis requirements of some groups:
| |||||||
Changed: | ||||||||
< < | condor log files | |||||||
> > | The cluster which executes a job is determined by the machine on which you issue the condor_submit command. For example, if you submit a job from a Neutrino system, it will run on the Neutrino cluster; if you submit a job from kolya or karthur , it runs on the general cluster; if you submit a job from xenia , it runs on the ATLAS cluster. | |||||||
Changed: | ||||||||
< < | If you want to see the condor daemons' log files for a machine with the name hostname , look in /a/data/<hostname>/condor/log . For example:
# ls -blrth /a/data/karthur/condor/log -rw-r--r-- 1 condor condor 153 2010-04-13 15:07 StarterLog -rw-r--r-- 1 condor condor 473K 2010-04-13 16:29 SchedLog -rw-r--r-- 1 root root 591K 2010-04-13 16:29 MasterLog -rw-r--r-- 1 root root 788K 2010-04-13 17:15 StartLog -rw-r--r-- 1 root root 562K 2010-04-13 17:25 NegotiatorLog -rw-r--r-- 1 root root 296K 2010-04-13 17:25 CollectorLog About the batch clusterBatch managerThe system responsible for administering batch services iscondor.nevis.columbia.edu . Users typically do not log in to this machine directly; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of condor.nevis.columbia.edu may be completely transparent to you.
Condor status and usageYou can see how much of the batch cluster is in use, and by whom:
Fair useThe condor system is most efficient when it's handling a large number of small jobs. Long jobs tend to clog up the queues, and prevent others from doing their work. As of Feb-2010, there is no system that gives some groups or users higher priority than others. However, condor comes with a default scheme![]() ![]()
condor_userprio -allusersThe larger the number, the lower your priority in comparison to the other users listed. | |||||||
> > | Where to learn | |||||||
Changed: | ||||||||
< < | What processing power is available | |||||||
> > | Let's start with the obvious: If someone else in your group has a set of condor scripts that work at Nevis, copy them! If you have to write your own: | |||||||
Changed: | ||||||||
< < | The following commands will show you the machines available to run your jobs, their status, and their resources:
condor_status condor_status -serverObviously, some machines are more powerful than others. Before you arbitrarily decide that only the most powerful machines are good enough for your jobs, consider:
![]() Rank = MipsWith all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your submit file ![]() Requirements = (Mips > 2000)This would restrict your job to the fastest processors on the cluster. All the machines on the batch farm are not the sameThe batch farm is a heterogeneous collection of machines. If you're having problems with programs crashing on some systems but not on others, please read this page on compiler tools![]() "Why isn't my job running on all the machines in the batch farm?"There may be several reasons:The heterogenous clusterNot all machines in the farm are the same; they have different amounts of memory, disk space, and occasionally even installed libraries. Make sure you use the standardized compiler![]() # If you're using bash: shopt -s expand_aliases source /usr/nevis/adm/nevis-init.sh setup root geant4Finally, don't forget to set initialdir in your condor submit file.
The job requirementsThere may be something explicit or implicit in the resources required to run your job. To pick an unrealistic example, if you job requiresksh and that shell isn't installed on machine, then it won't execute on the cluster. A more practical example: If you have the following in your job submit file:
Requirements = ( Memory > 1024 )then your job won't execute if the amount of memory per job queue is 1024 or less, including those machines with 1023 MB per queue to due rounding in the memory calculation. If you think your job with ID 4402 should be able to execute on machine batch04 , you can compare what condor thinks are the job's requirements against what the machine offers:
condor_q -long -global 4402 condor_status -long batch04 | |||||||
> > | The standard condor examples | |||||||
Changed: | ||||||||
< < | Another clue![]() condor_q . If you have a job held with an ID of 44.20:
condor_q -analyze 44.20 | |||||||
> > | One good way to start is to copy the Condor examples:
cp -arv /usr/share/doc/condor-*/examples . cd examples | |||||||
Changed: | ||||||||
< < | Suspended jobs | |||||||
> > | Read the README file; type make to compile the programs; type sh submit to submit a few test jobs. | |||||||
Changed: | ||||||||
< < | As noted elsewhere on this page, we generally use the vanilla universe at Nevis. This means if a job is suspended on a given machine, it can only continue on that particular machine. If that machine is running other jobs, then the suspended job must wait. | |||||||
> > | You may notice that the sh_loop script will not execute; it will sit in the "Idle" state indefinitely. It won't execute unless you submit it in the vanilla universe; see batch details. | |||||||
Changed: | ||||||||
< < | Extra disk space | |||||||
> > | Other programs in the examples may not work either. Look at the output, error, and log files; search the web for any error messages. This will provide experience when your "real" jobs begin to fail. | |||||||
Changed: | ||||||||
< < | In addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They were initially used by the ATLAS and D0 groups, but can be made available to other groups as the need arises. These disks are available via automount on the Linux cluster; each has a capacity of about 1.5TB. | |||||||
> > | Some practical examples | |||||||
Changed: | ||||||||
< < | The names of these RAID arrays are: | |||||||
> > | Many of the details have been combined into a set of example scripts. The Athena-related scripts are in ~seligman/condor/ ; start with the README file, which will point you to the other relevant files in the directory. Note that these examples were prepared in 2005, before we figured out how to do disk sharing properly. | |||||||
Changed: | ||||||||
< < |
| |||||||
> > | Submitting multiple jobs with one | |||||||
Changed: | ||||||||
< < | For example, the permissions on the drives have been set so that you can do the following from any machine on the Linux cluster (if you're a member of the ATLAS group): | |||||||
> > | An ATLAS example: Running Multiple Jobs On Condor![]() | |||||||
Changed: | ||||||||
< < | cd /a/data/condor/array2/atlas/ mkdir $user cd $user # ... create whatever files you want | |||||||
> > | As of Jun-2008, you can find several examples of multiple job submission in /a/home/houston/seligman/nusong/aria/work ; these go further with the tips in the above link, to generate both numeric and text parameters that vary according to condor's process ID. Look in the *.cmd files, which will lead you in turn to some of the *.sh files in that directory. There are hopefully enough comments in those scripts to get you started. Again, these examples were were written before we figured out how to do disk sharing. | |||||||
Changed: | ||||||||
< < | Important! If you're skimming this page, stop and read the following paragraph! | |||||||
> > | What's next | |||||||
Changed: | ||||||||
< < | The files on these /data partitions, like those on the /data partitions of any other systems on the Nevis cluster, are not backed up. They are stored on RAID5![]() | |||||||
> > | Now look at the pages on disk sharing and batch details. This will help you create scripts and command files that work in the current Nevis environment. Once you understand the concept of organizing the resources for your job, the rest of condor is relatively easy. | |||||||
Deleted: | ||||||||
< < | One more time: the disks on these partitions are not backed up! | |||||||
\ No newline at end of file | ||||||||
Added: | ||||||||
> > | Good luck! |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 33 to 33 | ||||||||
An ATLAS example: Running Multiple Jobs On Condor![]() | ||||||||
Changed: | ||||||||
< < | As of Jun-2008, you can find several examples of multiple job submission in /a/home/riverside/seligman/nusong/aria/work ; these go further with the tips in the above link, to generate both numeric and text parameters that vary according to condor's process ID. Look in the *.cmd files, which will lead you in turn to some of the *.sh files in that directory. There are hopefully enough comments in those scripts to get you started. | |||||||
> > | As of Jun-2008, you can find several examples of multiple job submission in /a/home/houston/seligman/nusong/aria/work ; these go further with the tips in the above link, to generate both numeric and text parameters that vary according to condor's process ID. Look in the *.cmd files, which will lead you in turn to some of the *.sh files in that directory. There are hopefully enough comments in those scripts to get you started. | |||||||
Submitting batch jobs |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 6 to 6 | ||||||||
Changed: | ||||||||
< < | This is a description of the batch job submission services available on the Linux cluster at Nevis Labs![]() | |||||||
> > | Stop. Read the page on disk sharing. | |||||||
Changed: | ||||||||
< < | Stop. Read this First. You have a program and perhaps a script. You just want to submit it and start thinking about physics again. But before you use condor, you have to think about resource management: How does condor know which files you'll need for input? How will the output files be transferred? Although the section on Resource Management is in the middle of this page, where it fits logically, it's the most important aspect of your condor job. Once you understand the concepts, the rest of condor is relatively easy. | |||||||
> > | Yeah, it's a pain: you have to read both this page and a whole separate page on disk management. But once you understand the concept of organizing the resources for your job, the rest of condor is relatively easy. | |||||||
Getting started | ||||||||
Line: 39 to 35 | ||||||||
As of Jun-2008, you can find several examples of multiple job submission in /a/home/riverside/seligman/nusong/aria/work ; these go further with the tips in the above link, to generate both numeric and text parameters that vary according to condor's process ID. Look in the *.cmd files, which will lead you in turn to some of the *.sh files in that directory. There are hopefully enough comments in those scripts to get you started. | ||||||||
Changed: | ||||||||
< < | Resource ManagementImportant: Think about how you transfer your job's filesPicture this: You submit a condor batch procedure that runs thousands of jobs. Each one of those jobs reads and/or writes directly into a directory on your server, accessed via an automount path like/a/data/tanya ; this means you're using NFS. Let's say there are 300 batch queues in the Nevis condor system. That means that 300 jobs are trying to access the disk on your server at once.
Sounds like a recipe for disaster, doesn't it? You can crash your server by writing 300 files at once via NFS. It's happened several times at Nevis.
To partially solve this problem, the condor batch nodes are blocked from writing to the /home and /data partitions on the servers.
In order to get around this, don't rely on your job reading and writing files directly to a particular directory. Use commands like the following in your condor command file; look them up in the User's Manual![]() should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = ...list of input files... initialdir = ...where your inputs and outputs are located...This will transfer your input files to the condor master server once, instead of 300 times; as your job is executing, it will write the output on a local area of the machine that's running the job. Once the job has finished executing, it will transfer the output file to the initialdir ![]() ![]() initialdir is located machine A (via automount), and you submit the job from machine B, then condor will use machine B to transfer the files to the directory on machine A. For example, if these lines are in your condor submit file:
initialdir = /a/data/kolya/jsmith queue 10000and you submit the job on the machine karthur, then as each of the 10,000 jobs terminates karthur will automount /a/data/kolya/jsmith on kolya to write the file; see condor_shadow![]() | |||||||
> > | Submitting batch jobs | |||||||
Memory limits | ||||||||
Line: 76 to 45 | ||||||||
Therefore, if you find that your large simulation program is being "spontaneously" canceled, look at its memory use. | ||||||||
Deleted: | ||||||||
< < | Submitting batch jobs | |||||||
Do you want 10,000 e-mails? | ||||||||
Changed: | ||||||||
< < | By default, condor will send you an e-mail message as each of your jobs completes. If you've submitted 10,000 jobs, that means 10,000 e-mails. This can clog the mail server, and make your life miserable. Please place the following line in your condor submit file: | |||||||
> > | By default, condor will send you an e-mail message as each of your jobs completes. If you've submitted 10,000 jobs, that means 10,000 e-mails. This can clog the mail server, and make your life miserable. Therefore, the following has been made default at Nevis: | |||||||
Notification = Error | ||||||||
Changed: | ||||||||
< < | This means that condor will only send you an e-mail if there's an error while running the job.
Do you want to use up all your disk space?At the end of most condor batch files, you'll see lines that look like this:output = mySimulation-$(Process).out error = mySimulation-$(Process).err log = mySimulation-$(Process).logThese lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. If initialdir ![]() /home partition on your server. Everyone in your group will be affected.
The general solution is to not write your output files into your home directory. Every workgroup server has a /data partition which is several TB in size. It's a good idea to make sure your output files are written to this partition.
You can do this via one of the following:
| |||||||
> > | This means that condor will only send you an e-mail if there's an error while running the job. Don't override it! | |||||||
Use the vanilla environment |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 132 to 132 | ||||||||
Condor status and usage | ||||||||
Changed: | ||||||||
< < | The CondorView![]() | |||||||
> > | You can see how much of the batch cluster is in use, and by whom:
| |||||||
Fair use |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 61 to 61 | ||||||||
Unless you specify a file's full pathname in the condor command file, the files will be copied to and from initialdir![]() | ||||||||
Changed: | ||||||||
< < | There's one more thing to think about: If initialdir is located machine A (via automount), and you submit the job from machine B, then condor will use machine B will be used to transfer the files to the directory on machine A. For example, if these lines are in your condor submit file: | |||||||
> > | There's one more thing to think about: If initialdir is located machine A (via automount), and you submit the job from machine B, then condor will use machine B to transfer the files to the directory on machine A. For example, if these lines are in your condor submit file: | |||||||
initialdir = /a/data/kolya/jsmith queue 10000 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 10 to 10 | ||||||||
Stop. Read this First. | ||||||||
Changed: | ||||||||
< < | You have a program and perhaps a script. You just want to submit it and start thinking about physics again. But before you use condor, you have to think about resource management: How does condor know which files you'll need for input? Where will the files go on output? | |||||||
> > | You have a program and perhaps a script. You just want to submit it and start thinking about physics again. But before you use condor, you have to think about resource management: How does condor know which files you'll need for input? How will the output files be transferred? | |||||||
Although the section on Resource Management is in the middle of this page, where it fits logically, it's the most important aspect of your condor job. Once you understand the concepts, the rest of condor is relatively easy. | ||||||||
Line: 61 to 61 | ||||||||
Unless you specify a file's full pathname in the condor command file, the files will be copied to and from initialdir![]() | ||||||||
Added: | ||||||||
> > | There's one more thing to think about: If initialdir is located machine A (via automount), and you submit the job from machine B, then condor will use machine B will be used to transfer the files to the directory on machine A. For example, if these lines are in your condor submit file:
initialdir = /a/data/kolya/jsmith queue 10000and you submit the job on the machine karthur, then as each of the 10,000 jobs terminates karthur will automount /a/data/kolya/jsmith on kolya to write the file; see condor_shadow![]() | |||||||
Memory limitsThe systems on the condor batch cluster have enough RAM for 1GB/processing queue. This means if your job uses more than 1GB of memory, there can be a problem. For example, if your job required 2GB of memory, and a condor batch node had 16 queues, then your 16 jobs will require 32GB of RAM, twice as much as the machine has. The machine will start swapping memory pages continuously, and essentially halt. | ||||||||
Line: 90 to 97 | ||||||||
log = mySimulation-$(Process).log | ||||||||
Changed: | ||||||||
< < | These lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. If initialdir![]() /home partition on your server. That means everyone in your group will be affected. | |||||||
> > | These lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. If initialdir![]() /home partition on your server. Everyone in your group will be affected. | |||||||
Changed: | ||||||||
< < | The general solution is to not write your output files into your home directory. Every workgroup server has a /data partition, which is normally several TB in size. It's a good idea to make sure your output files are written to this partition. | |||||||
> > | The general solution is to not write your output files into your home directory. Every workgroup server has a /data partition which is several TB in size. It's a good idea to make sure your output files are written to this partition. | |||||||
You can do this via one of the following: | ||||||||
Changed: | ||||||||
< < |
| |||||||
> > |
| |||||||
Use the vanilla environment |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 90 to 90 | ||||||||
log = mySimulation-$(Process).log | ||||||||
Changed: | ||||||||
< < | These lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. Sooner or later you'll fill up your home directory. Since you share the home directory on your server with everyone else in your working group, that means everyone in your group will be affected. | |||||||
> > | These lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. If initialdir![]() /home partition on your server. That means everyone in your group will be affected. | |||||||
The general solution is to not write your output files into your home directory. Every workgroup server has a /data partition, which is normally several TB in size. It's a good idea to make sure your output files are written to this partition. | ||||||||
Changed: | ||||||||
< < | You can do this by:
| |||||||
> > | You can do this via one of the following:
| |||||||
| ||||||||
Deleted: | ||||||||
< < | Don't forget to create a directory with a name like /a/data/<server>/<username>/ before you submit your first job. | |||||||
Use the vanilla environmentUnless you've specifically used the condor_compile![]() |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 12 to 12 | ||||||||
You have a program and perhaps a script. You just want to submit it and start thinking about physics again. But before you use condor, you have to think about resource management: How does condor know which files you'll need for input? Where will the files go on output? | ||||||||
Changed: | ||||||||
< < | Although the section on Resource Management is in the middle of this page, where it fits logically, it's the most important aspect of your condor job. Once you understand the concepts, the rest of condor is relatively easy. | |||||||
> > | Although the section on Resource Management is in the middle of this page, where it fits logically, it's the most important aspect of your condor job. Once you understand the concepts, the rest of condor is relatively easy. | |||||||
Getting started | ||||||||
Line: 39 to 39 | ||||||||
As of Jun-2008, you can find several examples of multiple job submission in /a/home/riverside/seligman/nusong/aria/work ; these go further with the tips in the above link, to generate both numeric and text parameters that vary according to condor's process ID. Look in the *.cmd files, which will lead you in turn to some of the *.sh files in that directory. There are hopefully enough comments in those scripts to get you started. | ||||||||
Added: | ||||||||
> > | Resource ManagementImportant: Think about how you transfer your job's filesPicture this: You submit a condor batch procedure that runs thousands of jobs. Each one of those jobs reads and/or writes directly into a directory on your server, accessed via an automount path like/a/data/tanya ; this means you're using NFS. Let's say there are 300 batch queues in the Nevis condor system. That means that 300 jobs are trying to access the disk on your server at once.
Sounds like a recipe for disaster, doesn't it? You can crash your server by writing 300 files at once via NFS. It's happened several times at Nevis.
To partially solve this problem, the condor batch nodes are blocked from writing to the /home and /data partitions on the servers.
In order to get around this, don't rely on your job reading and writing files directly to a particular directory. Use commands like the following in your condor command file; look them up in the User's Manual![]() should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = ...list of input files... initialdir = ...where your inputs and outputs are located...This will transfer your input files to the condor master server once, instead of 300 times; as your job is executing, it will write the output on a local area of the machine that's running the job. Once the job has finished executing, it will transfer the output file to the initialdir ![]() ![]() Memory limitsThe systems on the condor batch cluster have enough RAM for 1GB/processing queue. This means if your job uses more than 1GB of memory, there can be a problem. For example, if your job required 2GB of memory, and a condor batch node had 16 queues, then your 16 jobs will require 32GB of RAM, twice as much as the machine has. The machine will start swapping memory pages continuously, and essentially halt. To keep this from happening, condor will automatically cancel a job that requires more than 1GB of RAM. Unfortunately, condor has a problem estimating the amount of memory required by a running job: if a program uses threads, it will tend to overestimate; if a program uses shared libraries, it tends to underestimate. Therefore, if you find that your large simulation program is being "spontaneously" canceled, look at its memory use.Submitting batch jobsDo you want 10,000 e-mails?By default, condor will send you an e-mail message as each of your jobs completes. If you've submitted 10,000 jobs, that means 10,000 e-mails. This can clog the mail server, and make your life miserable. Please place the following line in your condor submit file:Notification = ErrorThis means that condor will only send you an e-mail if there's an error while running the job. Do you want to use up all your disk space?At the end of most condor batch files, you'll see lines that look like this:output = mySimulation-$(Process).out error = mySimulation-$(Process).err log = mySimulation-$(Process).logThese lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. Sooner or later you'll fill up your home directory. Since you share the home directory on your server with everyone else in your working group, that means everyone in your group will be affected. The general solution is to not write your output files into your home directory. Every workgroup server has a /data partition, which is normally several TB in size. It's a good idea to make sure your output files are written to this partition.
You can do this by:
/a/data/<server>/<username>/ before you submit your first job.
Use the vanilla environmentUnless you've specifically used the condor_compile![]() universe = vanilla condor log filesIf you want to see the condor daemons' log files for a machine with the namehostname , look in /a/data/<hostname>/condor/log . For example:
# ls -blrth /a/data/karthur/condor/log -rw-r--r-- 1 condor condor 153 2010-04-13 15:07 StarterLog -rw-r--r-- 1 condor condor 473K 2010-04-13 16:29 SchedLog -rw-r--r-- 1 root root 591K 2010-04-13 16:29 MasterLog -rw-r--r-- 1 root root 788K 2010-04-13 17:15 StartLog -rw-r--r-- 1 root root 562K 2010-04-13 17:25 NegotiatorLog -rw-r--r-- 1 root root 296K 2010-04-13 17:25 CollectorLog | |||||||
About the batch clusterBatch manager | ||||||||
Line: 154 to 235 | ||||||||
The files on these /data partitions, like those on the /data partitions of any other systems on the Nevis cluster, are not backed up. They are stored on RAID5![]() | ||||||||
Deleted: | ||||||||
< < |
Submitting batch jobsDo you want 10,000 e-mails?By default, condor will send you an e-mail message as each of your jobs completes. If you've submitted 10,000 jobs, that means 10,000 e-mails. This can clog the mail server, and make your life miserable. Please place the following line in your condor submit file:Notification = ErrorThis means that condor will only send you an e-mail if there's an error while running the job. Do you want to use up all your disk space?At the end of most condor batch files, you'll see lines that look like this:output = mySimulation-$(Process).out error = mySimulation-$(Process).err log = mySimulation-$(Process).logThese lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. Sooner or later you'll fill up your home directory. Since you share the home directory on your server with everyone else in your working group, that means everyone in your group will be affected. The general solution is to not write your output files into your home directory. Every workgroup server has a /data partition, which is normally several TB in size. It's a good idea to make sure your output files are written to this partition.
You can do this by:
output = /a/data/<server>/<username>/mySimulation--$(Process).out error = /a/data/<server>/<username>/mySimulation--$(Process).err log = /a/data/<server>/<username>/mySimulation-$(Process).logDon't forget to create /a/data/<server>/<username>/ before you submit your first job.
Use the vanilla environmentUnless you've specifically used the condor_compile![]() universe = vanilla Handling disk filesAs you read through the User's Manual![]() ![]() ![]() ![]() /a/data/tanya/seligman/kit/TestArea/ , I include the following line in my command script to make sure the executing machine has correctly mounted the directory:
initialdir = /a/data/tanya/seligman/kit/TestArea/ Important: Think about how you transfer your job's filesPicture this: You submit a condor batch procedure that runs thousands of jobs. Each one of those jobs reads and/or writes directly into a directory on your server, accessed via NFS. Let's say there are 200 batch queues in the Nevis condor system. That means that 200 jobs are trying to access the disk on your server at once. Sounds like a recipe for disaster, doesn't it? You can crash your server by writing 200 files at once via NFS. It's happened several times at Nevis. To solve this problem, don't rely on your job reading and writing files directly to a particular directory. Use commands like the following in your condor command file; look them up in the User's Manual![]() should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = ...list of input files...This will transfer your input files to the condor master server once, instead of 200 times; as your job is executing, it will write the output on a local area of the machine that's running the job. Once the job has finished executing, it will transfer the output file to the directory from which you submitted the job. Result: you don't crash your server. You also don't clog up the network with unnecessary file transfers. sh-style shells versus csh-style shellsWe have discovered that the vanilla environment described in the Condor manual does not behave exactly as documented at Nevis. The following advice may be helpful: There appears to be a difference in the way the sh![]() ![]() sh , bash , or zsh (the default at Nevis) the examples in the Condor manual basically work as they are. In csh or tcsh , the scripts will fail due to "file not found" errors unless you do one of the following:
should_transfer_files = YES when_to_transfer_output = ON_EXIT
executable = athena.cshbut the following succeeds: executable = /a/data/tanya/seligman/kit/TestArea/athena.csh condor log filesIf you want to see the condor daemons' log files for a machine with the namehostname , look in /a/data/<hostname>/condor/log . For example:
# ls -blrth /a/data/karthur/condor/log -rw-r--r-- 1 condor condor 153 2010-04-13 15:07 StarterLog -rw-r--r-- 1 condor condor 473K 2010-04-13 16:29 SchedLog -rw-r--r-- 1 root root 591K 2010-04-13 16:29 MasterLog -rw-r--r-- 1 root root 788K 2010-04-13 17:15 StartLog -rw-r--r-- 1 root root 562K 2010-04-13 17:25 NegotiatorLog -rw-r--r-- 1 root root 296K 2010-04-13 17:25 CollectorLog | |||||||
\ No newline at end of file |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 8 to 8 | ||||||||
This is a description of the batch job submission services available on the Linux cluster at Nevis Labs![]() | ||||||||
Added: | ||||||||
> > | Stop. Read this First. You have a program and perhaps a script. You just want to submit it and start thinking about physics again. But before you use condor, you have to think about resource management: How does condor know which files you'll need for input? Where will the files go on output? Although the section on Resource Management is in the middle of this page, where it fits logically, it's the most important aspect of your condor job. Once you understand the concepts, the rest of condor is relatively easy. | |||||||
Getting startedThe batch job submission system we're using at Nevis is Condor![]() ![]() |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Changed: | ||||||||
< < | This is a description of the batch job submission services available on the Linux cluster at Nevis Labs![]() | |||||||
> > | ||||||||
Changed: | ||||||||
< < | ||||||||
> > | This is a description of the batch job submission services available on the Linux cluster at Nevis Labs![]() | |||||||
Getting started |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 8 to 8 | ||||||||
Getting started | ||||||||
Added: | ||||||||
> > | The batch job submission system we're using at Nevis is Condor![]() ![]() | |||||||
The standard condor examplesIf you're just starting to learn Condor, a good way to start is to copy the Condor examples: | ||||||||
Line: 147 to 149 | ||||||||
Submitting batch jobs | ||||||||
Deleted: | ||||||||
< < | The batch job submission system we're using at Nevis is Condor![]() ![]() | |||||||
Do you want 10,000 e-mails?By default, condor will send you an e-mail message as each of your jobs completes. If you've submitted 10,000 jobs, that means 10,000 e-mails. This can clog the mail server, and make your life miserable. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 163 to 163 | ||||||||
At the end of most condor batch files, you'll see lines that look like this: | ||||||||
Changed: | ||||||||
< < | output = mySimulation--$(Process).out error = mySimulation--$(Process).err | |||||||
> > | output = mySimulation-$(Process).out error = mySimulation-$(Process).err | |||||||
log = mySimulation-$(Process).log These lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. Sooner or later you'll fill up your home directory. Since you share the home directory on your server with everyone else in your working group, that means everyone in your group will be affected. | ||||||||
Changed: | ||||||||
< < | The general solution is to not write your output files into your home directory. Every workgroup server has a /data partition, which is normally several TB in size. It's a good idea to make sure your output files are written to this partition. | |||||||
> > | The general solution is to not write your output files into your home directory. Every workgroup server has a /data partition, which is normally several TB in size. It's a good idea to make sure your output files are written to this partition. | |||||||
You can do this by:
|
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 159 to 159 | ||||||||
This means that condor will only send you an e-mail if there's an error while running the job. | ||||||||
Added: | ||||||||
> > | Do you want to use up all your disk space?At the end of most condor batch files, you'll see lines that look like this:output = mySimulation--$(Process).out error = mySimulation--$(Process).err log = mySimulation-$(Process).logThese lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. Sooner or later you'll fill up your home directory. Since you share the home directory on your server with everyone else in your working group, that means everyone in your group will be affected. The general solution is to not write your output files into your home directory. Every workgroup server has a /data partition, which is normally several TB in size. It's a good idea to make sure your output files are written to this partition.
You can do this by:
output = /a/data/<server>/<username>/mySimulation--$(Process).out error = /a/data/<server>/<username>/mySimulation--$(Process).err log = /a/data/<server>/<username>/mySimulation-$(Process).logDon't forget to create /a/data/<server>/<username>/ before you submit your first job. | |||||||
Use the vanilla environmentUnless you've specifically used the condor_compile![]() |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 6 to 6 | ||||||||
Added: | ||||||||
> > | Getting startedThe standard condor examplesIf you're just starting to learn Condor, a good way to start is to copy the Condor examples:cp -arv /usr/share/doc/condor-*/examples . cd examplesRead the README file; type make to compile the programs; type sh submit to submit a few test jobs.
You may notice that the sh_loop script will not execute; it will sit in the "Idle" state indefinitely. It won't execute unless you submit it in the vanilla universe as described below.
Examples that incorporate the tips on this pageMany of the following tips have been combined into a set of example scripts. The Athena-related scripts are in~seligman/condor/ ; start with the README file, which will point you to the other relevant files in the directory. Note that those examples were prepared in 2005.
Submitting multiple jobs with one
An ATLAS example: Running Multiple Jobs On Condor | |||||||
About the batch clusterBatch manager | ||||||||
Line: 184 to 207 | ||||||||
condor log files | ||||||||
Changed: | ||||||||
< < | If you want to see the condor daemons' log files for a machine with the name hostname , look in /a/data/hostname/condor/log . For example, to find out the "real" name of the current condor master server:
# host condor.nevis.columbia.edu condor.nevis.columbia.edu is an alias for karthur.nevis.columbia.edu.Then you can look at its log files: | |||||||
> > | If you want to see the condor daemons' log files for a machine with the name hostname , look in /a/data/<hostname>/condor/log . For example: | |||||||
# ls -blrth /a/data/karthur/condor/log -rw-r--r-- 1 condor condor 153 2010-04-13 15:07 StarterLog -rw-r--r-- 1 condor condor 473K 2010-04-13 16:29 SchedLog | ||||||||
Line: 197 to 217 | ||||||||
-rw-r--r-- 1 root root 562K 2010-04-13 17:25 NegotiatorLog -rw-r--r-- 1 root root 296K 2010-04-13 17:25 CollectorLog | ||||||||
Deleted: | ||||||||
< < |
ExamplesThe standard condor examplesIf you're just starting to learn Condor, a good way to start is to copy the Condor examples:cp -arv /usr/share/doc/condor-*/examples . cd examplesRead the README file; type make to compile the programs; type sh submit to submit a few test jobs.
You may notice that the sh_loop script will not execute; it will sit in the "Idle" state indefinitely. It won't execute unless you submit it in the vanilla universe as described below.
Examples that incorporate the tips on this pageMany of the above tips, and others, have been combined into a set of example scripts. The Athena-related scripts are in~seligman/condor/ ; start with the README file, which will point you to the other relevant files in the directory. Note that those examples were prepared in 2005.
Submitting multiple jobs with one
An ATLAS example: Running Multiple Jobs On Condor | |||||||
\ No newline at end of file |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 102 to 102 | ||||||||
Extra disk space | ||||||||
Changed: | ||||||||
< < | In addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They were initially used by the ATLAS and D0 groups, but can be made available to other groups as the need arises. These disks are available via automount![]() | |||||||
> > | In addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They were initially used by the ATLAS and D0 groups, but can be made available to other groups as the need arises. These disks are available via automount on the Linux cluster; each has a capacity of about 1.5TB. | |||||||
The names of these RAID arrays are: | ||||||||
Line: 144 to 144 | ||||||||
Handling disk files | ||||||||
Changed: | ||||||||
< < | As you read through the User's Manual![]() ![]() ![]() | |||||||
> > | As you read through the User's Manual![]() ![]() | |||||||
Changed: | ||||||||
< < | Because we use a shared file system![]() ![]() ![]() /a/data/tanya/seligman/kit/TestArea/ , I include the following line in my command script to make sure the executing machine has correctly mounted the directory: | |||||||
> > | Because we use a shared file system![]() ![]() /a/data/tanya/seligman/kit/TestArea/ , I include the following line in my command script to make sure the executing machine has correctly mounted the directory: | |||||||
initialdir = /a/data/tanya/seligman/kit/TestArea/ |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 10 to 10 | ||||||||
Batch manager | ||||||||
Changed: | ||||||||
< < | The system responsible for administering batch services is condor.nevis.columbia.edu . Users typically cannot log in to this machine; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of condor.nevis.columbia.edu may be completely transparent to you.
To use any of the condor commands given below, you have to set it up:
setup condor | |||||||
> > | The system responsible for administering batch services is condor.nevis.columbia.edu . Users typically do not log in to this machine directly; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of condor.nevis.columbia.edu may be completely transparent to you. | |||||||
Condor status and usage | ||||||||
Line: 24 to 20 | ||||||||
The condor system is most efficient when it's handling a large number of small jobs. Long jobs tend to clog up the queues, and prevent others from doing their work. | ||||||||
Changed: | ||||||||
< < | As of Feb-2010, there is no system that gives some groups or users higher priority than others. However, condor comes with a default scheme![]() ![]() | |||||||
> > | As of Feb-2010, there is no system that gives some groups or users higher priority than others. However, condor comes with a default scheme![]() ![]() | |||||||
The practical upshots of condor's default priority scheme:
| ||||||||
Line: 53 to 49 | ||||||||
| ||||||||
Changed: | ||||||||
< < | The best way to tell Condor that you'd prefer your job to execute on the faster machines is to use the Rank![]() | |||||||
> > | The best way to tell Condor that you'd prefer your job to execute on the faster machines is to use the Rank![]() | |||||||
Rank = Mips | ||||||||
Changed: | ||||||||
< < | With all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your submit file![]() | |||||||
> > | With all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your submit file![]() | |||||||
Requirements = (Mips > 2000)This would restrict your job to the fastest processors on the cluster. | ||||||||
Line: 95 to 91 | ||||||||
condor_status -long batch04 | ||||||||
Changed: | ||||||||
< < | Another clue![]() condor_q . If you have a job held with an ID of 44.20: | |||||||
> > | Another clue![]() condor_q . If you have a job held with an ID of 44.20: | |||||||
condor_q -analyze 44.20 | ||||||||
Line: 128 to 124 | ||||||||
Submitting batch jobs | ||||||||
Changed: | ||||||||
< < | The batch job submission system we're using at Nevis is Condor![]() ![]() ![]() setup condorThis will set the variable $CONDOR_CONFIG to ~condor/etc/condor_config , and add ~condor/bin to your $PATH . | |||||||
> > | The batch job submission system we're using at Nevis is Condor![]() ![]() | |||||||
Do you want 10,000 e-mails? | ||||||||
Line: 148 to 138 | ||||||||
Use the vanilla environment | ||||||||
Changed: | ||||||||
< < | Unless you've specifically used the condor_compile![]() | |||||||
> > | Unless you've specifically used the condor_compile![]() | |||||||
universe = vanilla Handling disk files | ||||||||
Changed: | ||||||||
< < | As you read through the User's Manual![]() ![]() ![]() | |||||||
> > | As you read through the User's Manual![]() ![]() ![]() | |||||||
Changed: | ||||||||
< < | Because we use a shared file system![]() ![]() ![]() /a/data/tanya/seligman/kit/TestArea/ , I include the following line in my command script to make sure the executing machine has correctly mounted the directory: | |||||||
> > | Because we use a shared file system![]() ![]() ![]() /a/data/tanya/seligman/kit/TestArea/ , I include the following line in my command script to make sure the executing machine has correctly mounted the directory: | |||||||
initialdir = /a/data/tanya/seligman/kit/TestArea/ | ||||||||
Line: 166 to 156 | ||||||||
Sounds like a recipe for disaster, doesn't it? You can crash your server by writing 200 files at once via NFS. It's happened several times at Nevis. | ||||||||
Changed: | ||||||||
< < | To solve this problem, don't rely on your job reading and writing files directly to a particular directory. Use commands like the following in your condor command file; look them up in the User's Manual![]() | |||||||
> > | To solve this problem, don't rely on your job reading and writing files directly to a particular directory. Use commands like the following in your condor command file; look them up in the User's Manual![]() | |||||||
should_transfer_files = YES when_to_transfer_output = ON_EXIT | ||||||||
Line: 192 to 182 | ||||||||
but the following succeeds:
executable = /a/data/tanya/seligman/kit/TestArea/athena.csh | ||||||||
Added: | ||||||||
> > | condor log filesIf you want to see the condor daemons' log files for a machine with the namehostname , look in /a/data/hostname/condor/log . For example, to find out the "real" name of the current condor master server:
# host condor.nevis.columbia.edu condor.nevis.columbia.edu is an alias for karthur.nevis.columbia.edu.Then you can look at its log files: # ls -blrth /a/data/karthur/condor/log -rw-r--r-- 1 condor condor 153 2010-04-13 15:07 StarterLog -rw-r--r-- 1 condor condor 473K 2010-04-13 16:29 SchedLog -rw-r--r-- 1 root root 591K 2010-04-13 16:29 MasterLog -rw-r--r-- 1 root root 788K 2010-04-13 17:15 StartLog -rw-r--r-- 1 root root 562K 2010-04-13 17:25 NegotiatorLog -rw-r--r-- 1 root root 296K 2010-04-13 17:25 CollectorLog | |||||||
ExamplesThe standard condor examplesIf you're just starting to learn Condor, a good way to start is to copy the Condor examples: | ||||||||
Changed: | ||||||||
< < | cp -arv ~condor/condor-7.2.4/examples . | |||||||
> > | cp -arv /usr/share/doc/condor-*/examples . | |||||||
cd examples |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 95 to 95 | ||||||||
condor_status -long batch04 | ||||||||
Added: | ||||||||
> > | Another clue![]() condor_q . If you have a job held with an ID of 44.20:
condor_q -analyze 44.20 | |||||||
Suspended jobsAs noted elsewhere on this page, we generally use the vanilla universe at Nevis. This means if a job is suspended on a given machine, it can only continue on that particular machine. If that machine is running other jobs, then the suspended job must wait. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 89 to 89 | ||||||||
then your job won't execute if the amount of memory per job queue is 1024 or less, including those machines with 1023 MB per queue to due rounding in the memory calculation. | ||||||||
Changed: | ||||||||
< < | If you think your job with ID 4402 should be able to execute on machine batch04 , you can compare what condor thinks are the jobs requirements against what the machine offers: | |||||||
> > | If you think your job with ID 4402 should be able to execute on machine batch04 , you can compare what condor thinks are the job's requirements against what the machine offers: | |||||||
condor_q -long -global 4402 condor_status -long batch04 | ||||||||
Line: 143 to 143 | ||||||||
Use the vanilla environment | ||||||||
Deleted: | ||||||||
< < | We have discovered that the vanilla environment described in the Condor manual does not behave exactly as documented at Nevis. The following advice may be helpful: | |||||||
Unless you've specifically used the condor_compile![]() universe = vanilla | ||||||||
Line: 176 to 174 | ||||||||
sh-style shells versus csh-style shells | ||||||||
Added: | ||||||||
> > | We have discovered that the vanilla environment described in the Condor manual does not behave exactly as documented at Nevis. The following advice may be helpful: | |||||||
There appears to be a difference in the way the sh![]() ![]() sh , bash , or zsh (the default at Nevis) the examples in the Condor manual basically work as they are. In csh or tcsh , the scripts will fail due to "file not found" errors unless you do one of the following:
|
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 12 to 12 | ||||||||
The system responsible for administering batch services is condor.nevis.columbia.edu . Users typically cannot log in to this machine; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of condor.nevis.columbia.edu may be completely transparent to you. | ||||||||
Added: | ||||||||
> > | To use any of the condor commands given below, you have to set it up:
setup condor Condor status and usageThe CondorView![]() | |||||||
Fair useThe condor system is most efficient when it's handling a large number of small jobs. Long jobs tend to clog up the queues, and prevent others from doing their work. | ||||||||
Line: 24 to 32 | ||||||||
If you use the vanilla environment (see below), as most users at Nevis must, for a job to be "pre-empted" means that it is suspended and until that same machine has an available queue. | ||||||||
Added: | ||||||||
> > | To get an idea of your user resource consumption and how it compares to other users, use these commands:
condor_userprio -allusersThe larger the number, the lower your priority in comparison to the other users listed. | |||||||
What processing power is availableThe following commands will show you the machines available to run your jobs, their status, and their resources: | ||||||||
Line: 53 to 67 | ||||||||
"Why isn't my job running on all the machines in the batch farm?" | ||||||||
Changed: | ||||||||
< < | You didn't read the previous section, did you? | |||||||
> > | There may be several reasons:
The heterogenous cluster | |||||||
Changed: | ||||||||
< < | Here it is again: Not all machines in the farm are the same; they have different amounts of memory, disk space, and occasionally even installed libraries. Make sure you use the standardized compiler![]() | |||||||
> > | Not all machines in the farm are the same; they have different amounts of memory, disk space, and occasionally even installed libraries. Make sure you use the standardized compiler![]() | |||||||
You'll also want to set up the standard Nevis environment explicitly in your jobs. If you look at the example described below, you'll see that the shell scripts all contain command such as:
# If you're using bash: | ||||||||
Line: 65 to 81 | ||||||||
Finally, don't forget to set initialdir in your condor submit file. | ||||||||
Added: | ||||||||
> > | The job requirementsThere may be something explicit or implicit in the resources required to run your job. To pick an unrealistic example, if you job requiresksh and that shell isn't installed on machine, then it won't execute on the cluster. A more practical example: If you have the following in your job submit file:
Requirements = ( Memory > 1024)then your job won't execute if the amount of memory per job queue is 1024 or less, including those machines with 1023 MB per queue to due rounding in the memory calculation. If you think your job with ID 4402 should be able to execute on machine batch04 , you can compare what condor thinks are the jobs requirements against what the machine offers:
condor_q -long -global 4402 condor_status -long batch04 Suspended jobsAs noted elsewhere on this page, we generally use the vanilla universe at Nevis. This means if a job is suspended on a given machine, it can only continue on that particular machine. If that machine is running other jobs, then the suspended job must wait. | |||||||
Extra disk spaceIn addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They were initially used by the ATLAS and D0 groups, but can be made available to other groups as the need arises. These disks are available via automount![]() | ||||||||
Line: 123 to 157 | ||||||||
initialdir = /a/data/tanya/seligman/kit/TestArea/ | ||||||||
Changed: | ||||||||
< < | Important: Think about how you handle your job's files | |||||||
> > | Important: Think about how you transfer your job's files | |||||||
Picture this: You submit a condor batch procedure that runs thousands of jobs. Each one of those jobs reads and/or writes directly into a directory on your server, accessed via NFS. Let's say there are 200 batch queues in the Nevis condor system. That means that 200 jobs are trying to access the disk on your server at once. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 22 to 22 | ||||||||
| ||||||||
Changed: | ||||||||
< < | If you use the vanilla environment (see below), as most users at Nevis must, for a job to be "pre-empted" means that it is killed and will be re-started from the beginning when a machine becomes available. | |||||||
> > | If you use the vanilla environment (see below), as most users at Nevis must, for a job to be "pre-empted" means that it is suspended and until that same machine has an available queue. | |||||||
What processing power is available | ||||||||
Line: 111 to 111 | ||||||||
We have discovered that the vanilla environment described in the Condor manual does not behave exactly as documented at Nevis. The following advice may be helpful: | ||||||||
Changed: | ||||||||
< < | Unless you've specifically used the condor_compile![]() | |||||||
> > | Unless you've specifically used the condor_compile![]() | |||||||
universe = vanilla |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 97 to 97 | ||||||||
This will set the variable $CONDOR_CONFIG to ~condor/etc/condor_config , and add ~condor/bin to your $PATH . | ||||||||
Added: | ||||||||
> > | Do you want 10,000 e-mails?By default, condor will send you an e-mail message as each of your jobs completes. If you've submitted 10,000 jobs, that means 10,000 e-mails. This can clog the mail server, and make your life miserable. Please place the following line in your condor submit file:Notification = ErrorThis means that condor will only send you an e-mail if there's an error while running the job. | |||||||
Use the vanilla environmentWe have discovered that the vanilla environment described in the Condor manual does not behave exactly as documented at Nevis. The following advice may be helpful: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Changed: | ||||||||
< < |
Batch Services at Nevis | |||||||
> > |
Batch Services at Nevis | |||||||
Changed: | ||||||||
< < | This is a description of the batch job submission services available on the Linux cluster![]() ![]() | |||||||
> > | This is a description of the batch job submission services available on the Linux cluster at Nevis Labs![]() | |||||||
Changed: | ||||||||
< < | Batch and disk services | |||||||
> > | About the batch clusterBatch manager | |||||||
The system responsible for administering batch services is condor.nevis.columbia.edu . Users typically cannot log in to this machine; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of condor.nevis.columbia.edu may be completely transparent to you. | ||||||||
Added: | ||||||||
> > | Fair useThe condor system is most efficient when it's handling a large number of small jobs. Long jobs tend to clog up the queues, and prevent others from doing their work. As of Feb-2010, there is no system that gives some groups or users higher priority than others. However, condor comes with a default scheme![]() ![]()
What processing power is availableThe following commands will show you the machines available to run your jobs, their status, and their resources:condor_status condor_status -serverObviously, some machines are more powerful than others. Before you arbitrarily decide that only the most powerful machines are good enough for your jobs, consider:
![]() Rank = MipsWith all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your submit file ![]() Requirements = (Mips > 2000)This would restrict your job to the fastest processors on the cluster. All the machines on the batch farm are not the sameThe batch farm is a heterogeneous collection of machines. If you're having problems with programs crashing on some systems but not on others, please read this page on compiler tools![]() "Why isn't my job running on all the machines in the batch farm?"You didn't read the previous section, did you? Here it is again: Not all machines in the farm are the same; they have different amounts of memory, disk space, and occasionally even installed libraries. Make sure you use the standardized compiler![]() # If you're using bash: shopt -s expand_aliases source /usr/nevis/adm/nevis-init.sh setup root geant4Finally, don't forget to set initialdir in your condor submit file.
Extra disk space | |||||||
In addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They were initially used by the ATLAS and D0 groups, but can be made available to other groups as the need arises. These disks are available via automount![]() | ||||||||
Line: 32 to 89 | ||||||||
Submitting batch jobs | ||||||||
Changed: | ||||||||
< < | The batch job submission system we're using at Nevis is Condor![]() ![]() | |||||||
> > | The batch job submission system we're using at Nevis is Condor![]() ![]() | |||||||
To use Condor at Nevis, the simplest way is to use the setup![]() | ||||||||
Line: 40 to 97 | ||||||||
This will set the variable $CONDOR_CONFIG to ~condor/etc/condor_config , and add ~condor/bin to your $PATH . | ||||||||
Deleted: | ||||||||
< < | What processing power is availableThe following commands will show you the machines available to run your jobs, their status, and their resources:condor_status condor_status -serverObviously, some machines are more powerful than others. Before you arbitrarily decide that only the most powerful machines are good enough for your jobs, consider:
![]() Rank = MipsWith all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your submit file ![]() Requirements = (Mips > 2000)This would restrict your job to the fastest processors on the cluster. | |||||||
Use the vanilla environmentWe have discovered that the vanilla environment described in the Condor manual does not behave exactly as documented at Nevis. The following advice may be helpful: | ||||||||
Changed: | ||||||||
< < | Unless you've specifically used the condor_compile![]() | |||||||
> > | Unless you've specifically used the condor_compile![]() | |||||||
universe = vanilla | ||||||||
Line: 79 to 113 | ||||||||
initialdir = /a/data/tanya/seligman/kit/TestArea/ | ||||||||
Added: | ||||||||
> > | Important: Think about how you handle your job's filesPicture this: You submit a condor batch procedure that runs thousands of jobs. Each one of those jobs reads and/or writes directly into a directory on your server, accessed via NFS. Let's say there are 200 batch queues in the Nevis condor system. That means that 200 jobs are trying to access the disk on your server at once. Sounds like a recipe for disaster, doesn't it? You can crash your server by writing 200 files at once via NFS. It's happened several times at Nevis. To solve this problem, don't rely on your job reading and writing files directly to a particular directory. Use commands like the following in your condor command file; look them up in the User's Manual![]() should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = ...list of input files...This will transfer your input files to the condor master server once, instead of 200 times; as your job is executing, it will write the output on a local area of the machine that's running the job. Once the job has finished executing, it will transfer the output file to the directory from which you submitted the job. Result: you don't crash your server. You also don't clog up the network with unnecessary file transfers. | |||||||
sh-style shells versus csh-style shellsThere appears to be a difference in the way the sh![]() ![]() sh , bash , or zsh (the default at Nevis) the examples in the Condor manual basically work as they are. In csh or tcsh , the scripts will fail due to "file not found" errors unless you do one of the following: | ||||||||
Line: 91 to 142 | ||||||||
executable = athena.cshbut the following succeeds: executable = /a/data/tanya/seligman/kit/TestArea/athena.csh | ||||||||
Deleted: | ||||||||
< < |
All the machines on the batch farm are not the sameThe batch farm is a heterogenous collection of machines; that is, they're not all running the same version of Fedora, nor do they all have the same version of gcc![]() ![]() "Why isn't my job running on all the machines in the batch farm?"You didn't read the previous section, did you? Here it is again: Not all machines in the farm are the same; they run different versions of Fedora. Make sure you use the standardized compiler![]() source /usr/nevis/adm/nevis-init.sh setup root geant4Finally, don't forget to set initialdir in your condor submit file. | |||||||
ExamplesThe standard condor examplesIf you're just starting to learn Condor, a good way to start is to copy the Condor examples: | ||||||||
Changed: | ||||||||
< < | cp -arv ~condor/condor-7.0.1/examples . | |||||||
> > | cp -arv ~condor/condor-7.2.4/examples . | |||||||
cd examples | ||||||||
Line: 123 to 157 | ||||||||
Examples that incorporate the tips on this page | ||||||||
Changed: | ||||||||
< < | Many of the above tips, and others, have been combined into a set of example scripts. They are in ~seligman/condor/ ; start with the README file, which will point you to the other relevant files in the directory. | |||||||
> > | Many of the above tips, and others, have been combined into a set of example scripts. The Athena-related scripts are in ~seligman/condor/ ; start with the README file, which will point you to the other relevant files in the directory. Note that those examples were prepared in 2005. | |||||||
Submitting multiple jobs with one
An ATLAS example: Running Multiple Jobs On Condor | ||||||||
Deleted: | ||||||||
< < | As of Jun-2008, you can find several examples of multiple job submission in /a/home/riverside/seligman/nusong/aria/work ; these go further with the tips in the above link, to generate both numeric and text parameters that vary according to condor's process ID. Look in the *.cmd files, which will lead you in turn to some of the *.sh files in that directory. There are hopefully enough comments in those scripts to get you started. | |||||||
\ No newline at end of file | ||||||||
Added: | ||||||||
> > | As of Jun-2008, you can find several examples of multiple job submission in /a/home/riverside/seligman/nusong/aria/work ; these go further with the tips in the above link, to generate both numeric and text parameters that vary according to condor's process ID. Look in the *.cmd files, which will lead you in turn to some of the *.sh files in that directory. There are hopefully enough comments in those scripts to get you started. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Changed: | ||||||||
< < |
| |||||||
> > |
| |||||||
Batch Services at NevisThis is a description of the batch job submission services available on the Linux cluster![]() ![]() |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 16 to 16 | ||||||||
| ||||||||
Deleted: | ||||||||
< < |
| |||||||
For example, the permissions on the drives have been set so that you can do the following from any machine on the Linux cluster (if you're a member of the ATLAS group): | ||||||||
Changed: | ||||||||
< < | cd /a/data/condor/array3/atlas/ | |||||||
> > | cd /a/data/condor/array2/atlas/ | |||||||
mkdir $user cd $user # ... create whatever files you want |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 95 to 95 | ||||||||
All the machines on the batch farm are not the same | ||||||||
Changed: | ||||||||
< < | The batch farm is a heterogenous collection of machines; that is, they're not all running the same version of Fedora Core, nor do they all have the same version of gcc![]() ![]() | |||||||
> > | The batch farm is a heterogenous collection of machines; that is, they're not all running the same version of Fedora, nor do they all have the same version of gcc![]() ![]() "Why isn't my job running on all the machines in the batch farm?"You didn't read the previous section, did you? Here it is again: Not all machines in the farm are the same; they run different versions of Fedora. Make sure you use the standardized compiler![]() source /usr/nevis/adm/nevis-init.sh setup root geant4Finally, don't forget to set initialdir in your condor submit file. | |||||||
Examples |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 6 to 6 | ||||||||
Changed: | ||||||||
< < | Running Multiple Jobs On Condor![]() | |||||||
> > | Batch and disk services | |||||||
Changed: | ||||||||
< < | Batch server | |||||||
> > | The system responsible for administering batch services is condor.nevis.columbia.edu . Users typically cannot log in to this machine; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of condor.nevis.columbia.edu may be completely transparent to you. | |||||||
Changed: | ||||||||
< < | The system responsible for administering batches services is condor.nevis.columbia.edu . Users typically cannot log in to this machine; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of condor.nevis.columbia.edu may be completely transparent to you.
In addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They are initially to be used by the ATLAS and D0 groups, as noted below, but may be made available to other groups as the need arises. These disks are available via automount![]() | |||||||
> > | In addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They were initially used by the ATLAS and D0 groups, but can be made available to other groups as the need arises. These disks are available via automount![]() | |||||||
The names of these RAID arrays are: | ||||||||
Line: 20 to 18 | ||||||||
| ||||||||
Changed: | ||||||||
< < | For example, the permissions on the drives have been set so that you can do the following from any machine on the Linux cluster (if you're a member of the ATLAS group): | |||||||
> > | For example, the permissions on the drives have been set so that you can do the following from any machine on the Linux cluster (if you're a member of the ATLAS group):
cd /a/data/condor/array3/atlas/ mkdir $user cd $user # ... create whatever files you want | |||||||
Important! If you're skimming this page, stop and read the following paragraph! | ||||||||
Changed: | ||||||||
< < | The files on these /data partitions, like those on the /data partitions of any other systems on the Nevis cluster, are not backed up. They are stored on RAID5![]() | |||||||
> > | The files on these /data partitions, like those on the /data partitions of any other systems on the Nevis cluster, are not backed up. They are stored on RAID5![]() | |||||||
One more time: the disks on these partitions are not backed up! | ||||||||
Line: 34 to 37 | ||||||||
To use Condor at Nevis, the simplest way is to use the setup![]() | ||||||||
Changed: | ||||||||
< < | ||||||||
> > | setup condor | |||||||
This will set the variable $CONDOR_CONFIG to ~condor/etc/condor_config , and add ~condor/bin to your $PATH . | ||||||||
Changed: | ||||||||
< < | Condor tips and tricksExamplesIf you're just starting to learn Condor (as I am), a good start is to copy the Condor examples:make to compile the programs; type sh submit to submit a few test jobs.
You may notice that the sh_loop script will not execute; it will sit in the "Idle" state indefinitely. It won't execute unless you submit it in the vanilla universe as described below. | |||||||
> > | What processing power is available | |||||||
Changed: | ||||||||
< < | What processing power is available | |||||||
> > | The following commands will show you the machines available to run your jobs, their status, and their resources: | |||||||
Changed: | ||||||||
< < | The following commands will show you the machines available to run your jobs, their status, and their resources: | |||||||
> > | condor_status condor_status -server | |||||||
Obviously, some machines are more powerful than others. Before you arbitrarily decide that only the most powerful machines are good enough for your jobs, consider: | ||||||||
Line: 60 to 56 | ||||||||
| ||||||||
Changed: | ||||||||
< < | The best way to tell Condor that you'd prefer your job to execute on the faster machines is to use the Rank![]() | |||||||
> > | The best way to tell Condor that you'd prefer your job to execute on the faster machines is to use the Rank![]() Rank = Mips | |||||||
Changed: | ||||||||
< < | With all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your submit file![]() | |||||||
> > | With all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your submit file![]() Requirements = (Mips > 2000) | |||||||
This would restrict your job to the fastest processors on the cluster. | ||||||||
Changed: | ||||||||
< < | Use the vanilla environment | |||||||
> > | Use the vanilla environment | |||||||
We have discovered that the vanilla environment described in the Condor manual does not behave exactly as documented at Nevis. The following advice may be helpful:
Unless you've specifically used the condor_compile![]() | ||||||||
Changed: | ||||||||
< < | ||||||||
> > | universe = vanilla | |||||||
Changed: | ||||||||
< < | Handling disk files | |||||||
> > | Handling disk files | |||||||
As you read through the User's Manual![]() ![]() ![]() | ||||||||
Changed: | ||||||||
< < | Because we use a shared file system![]() ![]() ![]() /a/data/tanya/seligman/kit/TestArea/ , I include the following line in my command script to make sure the executing machine has correctly mounted the directory: | |||||||
> > | Because we use a shared file system![]() ![]() ![]() /a/data/tanya/seligman/kit/TestArea/ , I include the following line in my command script to make sure the executing machine has correctly mounted the directory: | |||||||
Changed: | ||||||||
< < | sh-style shells versus csh-style shells | |||||||
> > | initialdir = /a/data/tanya/seligman/kit/TestArea/ | |||||||
Changed: | ||||||||
< < | There appears to be a difference in the way the [[http://www.nevis.columbia.edu/cgi-bin/man.sh?man=sh][sh] and csh![]() sh , bash , or zsh (the default at Nevis) the examples in the Condor manual basically work as they are. In csh or tcsh , the scripts will fail due to "file not found" errors unless you do one of the following: | |||||||
> > | sh-style shells versus csh-style shells | |||||||
Changed: | ||||||||
< < |
| |||||||
> > | There appears to be a difference in the way the sh![]() ![]() sh , bash , or zsh (the default at Nevis) the examples in the Condor manual basically work as they are. In csh or tcsh , the scripts will fail due to "file not found" errors unless you do one of the following: | |||||||
Changed: | ||||||||
< < |
| |||||||
> > |
should_transfer_files = YES when_to_transfer_output = ON_EXIT | |||||||
Changed: | ||||||||
< < | All the machines on the batch farm are not the same | |||||||
> > |
executable = athena.cshbut the following succeeds: executable = /a/data/tanya/seligman/kit/TestArea/athena.csh All the machines on the batch farm are not the same | |||||||
The batch farm is a heterogenous collection of machines; that is, they're not all running the same version of Fedora Core, nor do they all have the same version of gcc![]() ![]() | ||||||||
Changed: | ||||||||
< < | More examples | |||||||
> > | ExamplesThe standard condor examplesIf you're just starting to learn Condor, a good way to start is to copy the Condor examples:cp -arv ~condor/condor-7.0.1/examples . cd examplesRead the README file; type make to compile the programs; type sh submit to submit a few test jobs.
You may notice that the sh_loop script will not execute; it will sit in the "Idle" state indefinitely. It won't execute unless you submit it in the vanilla universe as described below.
Examples that incorporate the tips on this page | |||||||
Many of the above tips, and others, have been combined into a set of example scripts. They are in ~seligman/condor/ ; start with the README file, which will point you to the other relevant files in the directory. | ||||||||
Added: | ||||||||
> > |
Submitting multiple jobs with one
An ATLAS example: Running Multiple Jobs On Condor |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 6 to 6 | ||||||||
Added: | ||||||||
> > | Running Multiple Jobs On Condor![]() | |||||||
Batch serverThe system responsible for administering batches services iscondor.nevis.columbia.edu . Users typically cannot log in to this machine; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of condor.nevis.columbia.edu may be completely transparent to you. | ||||||||
Changed: | ||||||||
< < |
In addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be
shared among the users of Nevis batch system. They are initially to be used by the ATLAS and D0 groups, as noted below, but may be made available to other groups as the need arises. These disks are available via automount | |||||||
> > |
In addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They are initially to be used by the ATLAS and D0 groups, as noted below, but may be made available to other groups as the need arises. These disks are available via automount![]() | |||||||
The names of these RAID arrays are: | ||||||||
Line: 19 to 20 | ||||||||
| ||||||||
Changed: | ||||||||
< < | For example, the permissions on the drives have been set so that you can do the following from any machine on the Linux cluster (if you're a member of the ATLAS group):
| |||||||
> > | For example, the permissions on the drives have been set so that you can do the following from any machine on the Linux cluster (if you're a member of the ATLAS group): | |||||||
Important! If you're skimming this page, stop and read the following paragraph! | ||||||||
Changed: | ||||||||
< < | The files on these /data partitions, like those on the /data partitions of any other systems on the Nevis cluster, are not backed up. They are stored on RAID5![]() | |||||||
> > | The files on these /data partitions, like those on the /data partitions of any other systems on the Nevis cluster, are not backed up. They are stored on RAID5![]() | |||||||
One more time: the disks on these partitions are not backed up! | ||||||||
Line: 39 to 34 | ||||||||
To use Condor at Nevis, the simplest way is to use the setup![]() | ||||||||
Changed: | ||||||||
< < | ||||||||
> > | ||||||||
This will set the variable $CONDOR_CONFIG to ~condor/etc/condor_config , and add ~condor/bin to your $PATH . | ||||||||
Line: 49 to 42 | ||||||||
Examples | ||||||||
Changed: | ||||||||
< < | If you're just starting to learn Condor (as I am), a good start is to copy the Condor examples:
| |||||||
> > | If you're just starting to learn Condor (as I am), a good start is to copy the Condor examples: | |||||||
Read the README file; type make to compile the programs; type sh submit to submit a few test jobs. | ||||||||
Line: 61 to 50 | ||||||||
What processing power is available | ||||||||
Changed: | ||||||||
< < | The following commands will show you the machines available to run your jobs, their status, and their resources:
| |||||||
> > | The following commands will show you the machines available to run your jobs, their status, and their resources: | |||||||
Changed: | ||||||||
< < | Obviously, some machines are more powerful than others. Before you arbitrarily decide that only the most powerful machines are good enough for your jobs, consider: | |||||||
> > | Obviously, some machines are more powerful than others. Before you arbitrarily decide that only the most powerful machines are good enough for your jobs, consider: | |||||||
| ||||||||
Line: 76 to 60 | ||||||||
| ||||||||
Changed: | ||||||||
< < | The best way to tell Condor that you'd prefer your job to execute on the faster machines is to use the Rank![]() ![]() | |||||||
> > | The best way to tell Condor that you'd prefer your job to execute on the faster machines is to use the Rank![]() ![]() | |||||||
This would restrict your job to the fastest processors on the cluster.
Use the vanilla environment | ||||||||
Changed: | ||||||||
< < | We have discovered that the vanilla environment described in the Condor manual does not behave exactly as documented at Nevis. The following advice may be helpful: | |||||||
> > | We have discovered that the vanilla environment described in the Condor manual does not behave exactly as documented at Nevis. The following advice may be helpful: | |||||||
Unless you've specifically used the condor_compile![]() | ||||||||
Changed: | ||||||||
< < | ||||||||
> > | ||||||||
Handling disk files | ||||||||
Changed: | ||||||||
< < | As you read through the [[http://www.cs.wisc.edu/condor/manual/v7.0/2_Users_Manual.html][User's
Manual]] chapter on job submission![]() ![]() | |||||||
> > | As you read through the User's Manual![]() ![]() ![]() | |||||||
Changed: | ||||||||
< < | Because we use a shared file system![]() ![]() ![]() /a/data/tanya/seligman/kit/TestArea/ , I include the following line in my command script to make sure the executing machine has correctly mounted the directory:
| |||||||
> > | Because we use a shared file system![]() ![]() ![]() /a/data/tanya/seligman/kit/TestArea/ , I include the following line in my command script to make sure the executing machine has correctly mounted the directory: | |||||||
sh-style shells versus csh-style shells | ||||||||
Changed: | ||||||||
< < | There appears to be a difference in the way the [[http://www.nevis.columbia.edu/cgi-bin/man.sh?man=sh][sh] and
csh![]() sh , bash , or
zsh (the default at Nevis) the examples in the Condor manual basically work as they are. In csh or tcsh ,
the scripts will fail due to "file not found" errors unless you do one of the following:
| |||||||
> > | There appears to be a difference in the way the [[http://www.nevis.columbia.edu/cgi-bin/man.sh?man=sh][sh] and csh![]() sh , bash , or zsh (the default at Nevis) the examples in the Condor manual basically work as they are. In csh or tcsh , the scripts will fail due to "file not found" errors unless you do one of the following:
| |||||||
All the machines on the batch farm are not the same | ||||||||
Changed: | ||||||||
< < | The batch farm is a heterogenous collection of machines; that is, they're not all running the same version of Fedora Core, nor do they
all have the same version of gcc![]() ![]() | |||||||
> > | The batch farm is a heterogenous collection of machines; that is, they're not all running the same version of Fedora Core, nor do they all have the same version of gcc![]() ![]() | |||||||
More examples | ||||||||
Changed: | ||||||||
< < | Many of the above tips, and others, have been combined into a set of example scripts. They are in ~seligman/condor/ ; start with
the README file, which will point you to the other relevant files in the directory. | |||||||
> > | Many of the above tips, and others, have been combined into a set of example scripts. They are in ~seligman/condor/ ; start with the README file, which will point you to the other relevant files in the directory. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Added: | ||||||||
> > |
Batch Services at NevisThis is a description of the batch job submission services available on the Linux cluster![]() ![]() Batch serverThe system responsible for administering batches services iscondor.nevis.columbia.edu . Users typically cannot log in to this machine; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of condor.nevis.columbia.edu may be completely transparent to you.
In addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be
shared among the users of Nevis batch system. They are initially to be used by the ATLAS and D0 groups, as noted below, but may be made available to other groups as the need arises. These disks are available via automount
![]() Submitting batch jobsThe batch job submission system we're using at Nevis is Condor![]() ![]() ![]() $CONDOR_CONFIG to ~condor/etc/condor_config , and add ~condor/bin to your $PATH .
Condor tips and tricksExamplesIf you're just starting to learn Condor (as I am), a good start is to copy the Condor examples:make to compile the programs; type sh submit to submit a few test jobs.
You may notice that the sh_loop script will not execute; it will sit in the "Idle" state indefinitely. It won't execute unless you submit it in the vanilla universe as described below.
What processing power is availableThe following commands will show you the machines available to run your jobs, their status, and their resources:
![]() ![]() Use the vanilla environmentWe have discovered that the vanilla environment described in the Condor manual does not behave exactly as documented at Nevis. The following advice may be helpful: Unless you've specifically used the condor_compile![]() Handling disk filesAs you read through the [[http://www.cs.wisc.edu/condor/manual/v7.0/2_Users_Manual.html][User's Manual]] chapter on job submission![]() ![]() ![]() ![]() ![]() /a/data/tanya/seligman/kit/TestArea/ , I include the following line in my command script to make sure the executing machine has correctly mounted the directory:
sh-style shells versus csh-style shellsThere appears to be a difference in the way the [[http://www.nevis.columbia.edu/cgi-bin/man.sh?man=sh][sh] and csh![]() sh , bash , or
zsh (the default at Nevis) the examples in the Condor manual basically work as they are. In csh or tcsh ,
the scripts will fail due to "file not found" errors unless you do one of the following:
All the machines on the batch farm are not the sameThe batch farm is a heterogenous collection of machines; that is, they're not all running the same version of Fedora Core, nor do they all have the same version of gcc![]() ![]() More examplesMany of the above tips, and others, have been combined into a set of example scripts. They are in~seligman/condor/ ; start with
the README file, which will point you to the other relevant files in the directory. |