Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Details of running condor at Nevis | ||||||||
Line: 68 to 68 | ||||||||
condor log files | ||||||||
Changed: | ||||||||
< < | If you want to see the condor daemons' log files for a machine with the name hostname , look in /a/data/<hostname>/condor/log . For example: | |||||||
> > | If you want to see the condor daemons' log files for a machine with the name hostname , look in /nevis/<hostname>/data/condor/log . For example: | |||||||
Changed: | ||||||||
< < | # ls -blrth /a/data/karthur/condor/log | |||||||
> > | # ls -blrth /nevis/tehanu/datacondor/log | |||||||
-rw-r--r-- 1 condor condor 153 2010-04-13 15:07 StarterLog -rw-r--r-- 1 condor condor 473K 2010-04-13 16:29 SchedLog -rw-r--r-- 1 root root 591K 2010-04-13 16:29 MasterLog |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Details of running condor at Nevis | ||||||||
Line: 33 to 33 | ||||||||
Nevis software initialization | ||||||||
Changed: | ||||||||
< < | Nevis environment modules command require initialization. When you login, this initialization is done for you; look at your ~/.profile file (~/.cshrc if you use tcsh). You have to explicitly include this line if you're submitting a batch job and you're using setup | |||||||
> > | Nevis environment modules command require initialization. When you login, this initialization is done for you; look at your ~/.profile file (~/.cshrc if you use tcsh). You have to explicitly include this line if you're submitting a batch job and you're using module load . | |||||||
source /usr/nevis/adm/nevis-init.sh | ||||||||
Line: 46 to 46 | ||||||||
Memory limits | ||||||||
Changed: | ||||||||
< < | The systems on the condor batch cluster have enough RAM for 1GB/processing queue. This means if your job uses more than 1GB of memory, there can be a problem. For example, if your job required 2GB of memory, and a condor batch node had 16 queues, then your 16 jobs will require 32GB of RAM, twice as much as the machine has. The machine will start swapping memory pages continuously, and essentially halt. | |||||||
> > | Many systems on the condor batch cluster have only enough RAM for 1GB/processing queue. This means if your job uses more than 1GB of memory, there can be a problem. For example, if your job required 2GB of memory, and a condor batch node had 16 queues, then your 16 jobs will require 32GB of RAM, twice as much as the machine has. The machine will start swapping memory pages continuously, and essentially halt. | |||||||
Changed: | ||||||||
< < | To keep this from happening, condor will automatically cancel a job that requires more than 1GB of RAM. Unfortunately, condor has a problem estimating the amount of memory required by a running job: if a program uses threads, it will tend to overestimate; if a program uses shared libraries, it tends to underestimate. | |||||||
> > | To keep this from happening, condor will automatically cancel a job that requires more RAM than a queue has available. Unfortunately, condor has a problem estimating the amount of memory required by a running job: if a program uses threads, it will tend to overestimate; if a program uses shared libraries, it tends to underestimate. | |||||||
Therefore, if you find that your large simulation program is being "spontaneously" canceled, look at its memory use. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Details of running condor at Nevis | ||||||||
Line: 148 to 148 | ||||||||
Not all machines in the farm are the same; they have different amounts of memory, disk space, and occasionally even installed libraries. The standardized compiler can help solve this problem. You may also want to set up the standard Nevis environment explicitly in your jobs. If you look at the example described below, you'll see that the shell scripts all contain command such as: | ||||||||
Changed: | ||||||||
< < | # If you're using bash: | |||||||
> > | # The following line is only needed if you're using bash: | |||||||
shopt -s expand_aliases source /usr/nevis/adm/nevis-init.sh | ||||||||
Changed: | ||||||||
< < | setup root geant4 | |||||||
> > | module load root geant4 | |||||||
Finally, don't forget to set initialdir in your condor submit file. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Details of running condor at Nevis | ||||||||
Line: 31 to 31 | ||||||||
If you can't figure out how the above lines work, then simply don't do it. | ||||||||
Changed: | ||||||||
< < | | |||||||
> > | Nevis software initialization | |||||||
Changed: | ||||||||
< < | The Nevis setup command requires initialization. When you login, this initialization is done for you; look at your ~/.profile file (~/.cshrc if you use tcsh). You have to explicitly include this line if you're submitting a batch job and you're using setup | |||||||
> > | Nevis environment modules command require initialization. When you login, this initialization is done for you; look at your ~/.profile file (~/.cshrc if you use tcsh). You have to explicitly include this line if you're submitting a batch job and you're using setup | |||||||
Deleted: | ||||||||
< < | shopt -s expand_aliases # This line is only necessary if you're using bash | |||||||
source /usr/nevis/adm/nevis-init.sh |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Details of running condor at Nevis | ||||||||
Line: 33 to 33 | ||||||||
| ||||||||
Changed: | ||||||||
< < | The Nevis setup command requires initialization. When you login, this initialization is done for you; look at your ~/.profile file (~/.cshrc if you use tcsh). You have to explicitly include this line if you're submitting a batch job: | |||||||
> > | The Nevis setup command requires initialization. When you login, this initialization is done for you; look at your ~/.profile file (~/.cshrc if you use tcsh). You have to explicitly include this line if you're submitting a batch job and you're using setup | |||||||
shopt -s expand_aliases # This line is only necessary if you're using bash source /usr/nevis/adm/nevis-init.sh | ||||||||
Line: 41 to 41 | ||||||||
Replace .sh with .csh if you use tcsh. | ||||||||
Added: | ||||||||
> > | Note: If you are using a software framework with its own copy of ROOT or Geant4, you probably don't need to do this. This includes ATLAS Athena and MicroBooNE's LArSoft. Those packages have their own setup scripts, and you should use them instead. | |||||||
Submitting batch jobsMemory limits | ||||||||
Line: 144 to 146 | ||||||||
The heterogenous cluster | ||||||||
Changed: | ||||||||
< < | Not all machines in the farm are the same; they have different amounts of memory, disk space, and occasionally even installed libraries. Make sure you use the standardized compiler to compile your programs. | |||||||
> > | Not all machines in the farm are the same; they have different amounts of memory, disk space, and occasionally even installed libraries. The standardized compiler can help solve this problem. | |||||||
Changed: | ||||||||
< < | You'll also want to set up the standard Nevis environment explicitly in your jobs. If you look at the example described below, you'll see that the shell scripts all contain command such as: | |||||||
> > | You may also want to set up the standard Nevis environment explicitly in your jobs. If you look at the example described below, you'll see that the shell scripts all contain command such as: | |||||||
# If you're using bash: shopt -s expand_aliases source /usr/nevis/adm/nevis-init.sh |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Details of running condor at Nevis | ||||||||
Line: 136 to 136 | ||||||||
All the machines on the batch farm are not the same | ||||||||
Changed: | ||||||||
< < | The batch farm is a heterogeneous collection of machines. If you're having problems with programs crashing on some systems but not on others, please read this page on compiler tools![]() | |||||||
> > | The batch farm is a heterogeneous collection of machines. If you're having problems with programs crashing on some systems but not on others, please read this page on compiler tools that can help solve this problem. | |||||||
"Why isn't my job running on all the machines in the batch farm?" | ||||||||
Line: 144 to 144 | ||||||||
The heterogenous cluster | ||||||||
Changed: | ||||||||
< < | Not all machines in the farm are the same; they have different amounts of memory, disk space, and occasionally even installed libraries. Make sure you use the standardized compiler![]() | |||||||
> > | Not all machines in the farm are the same; they have different amounts of memory, disk space, and occasionally even installed libraries. Make sure you use the standardized compiler to compile your programs. | |||||||
You'll also want to set up the standard Nevis environment explicitly in your jobs. If you look at the example described below, you'll see that the shell scripts all contain command such as:
# If you're using bash: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Details of running condor at Nevis | ||||||||
Line: 61 to 61 | ||||||||
Use the vanilla environment | ||||||||
Changed: | ||||||||
< < | Unless you've specifically used the condor_compile![]() | |||||||
> > | Unless you've specifically used the condor_compile![]() | |||||||
universe = vanilla | ||||||||
Line: 97 to 97 | ||||||||
The condor system is most efficient when it's handling a large number of small jobs. Long jobs tend to clog up the queues, and prevent others from doing their work. | ||||||||
Changed: | ||||||||
< < | As of Feb-2010, there is no system that gives some groups or users higher priority than others. However, condor comes with a default scheme![]() ![]() | |||||||
> > | As of Feb-2010, there is no system that gives some groups or users higher priority than others. However, condor comes with a default scheme![]() ![]() | |||||||
The practical upshots of condor's default priority scheme:
| ||||||||
Line: 126 to 126 | ||||||||
| ||||||||
Changed: | ||||||||
< < | The best way to tell Condor that you'd prefer your job to execute on the faster machines is to use the Rank![]() | |||||||
> > | The best way to tell Condor that you'd prefer your job to execute on the faster machines is to use the Rank![]() | |||||||
Rank = Mips | ||||||||
Changed: | ||||||||
< < | With all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your submit file![]() | |||||||
> > | With all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your submit file![]() | |||||||
Requirements = (Mips > 2000)This would restrict your job to the fastest processors on the cluster. | ||||||||
Line: 168 to 168 | ||||||||
condor_status -long slot1@batch04 | ||||||||
Changed: | ||||||||
< < | Another clue![]() condor_q . If you have a job held with an ID of 44.20: | |||||||
> > | Another clue![]() condor_q . If you have a job held with an ID of 44.20: | |||||||
condor_q -analyze 44.20 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Details of running condor at Nevis | ||||||||
Line: 162 to 162 | ||||||||
then your job won't execute if the amount of memory per job queue is 1024 or less, including those machines with 1023 MB per queue to due rounding in the memory calculation. | ||||||||
Changed: | ||||||||
< < | If you think your job with ID 4402 should be able to execute on machine batch04 , you can compare what condor thinks are the job's requirements against what the machine offers: | |||||||
> > | If you think your job with ID 4402 should be able to execute on queue slot1@batch04 , you can compare what condor thinks are the job's requirements against what the machine offers: | |||||||
condor_q -long -global 4402 | ||||||||
Changed: | ||||||||
< < | condor_status -long batch04 | |||||||
> > | condor_status -long slot1@batch04 | |||||||
Another clue![]() condor_q . If you have a job held with an ID of 44.20: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Details of running condor at Nevis | ||||||||
Line: 176 to 176 | ||||||||
Suspended jobsAs noted elsewhere on this page, we generally use the vanilla universe at Nevis. This means if a job is suspended on a given machine, it can only continue on that particular machine. If that machine is running other jobs, then the suspended job must wait. | ||||||||
Deleted: | ||||||||
< < |
Extra disk spaceIn addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They were initially used by the ATLAS and D0 groups, but can be made available to other groups as the need arises. These disks are available via automount on the Linux cluster; each has a capacity of about 1.5TB. The names of these RAID arrays are:
cd /a/data/condor/array2/atlas/ mkdir $user cd $user # ... create whatever files you wantImportant! If you're skimming this page, stop and read the following paragraph! The files on these /data partitions, like those on the /data partitions of any other systems on the Nevis cluster, are not backed up. They are stored on RAID5![]() |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Details of running condor at Nevis | ||||||||
Line: 35 to 35 | ||||||||
The Nevis setup command requires initialization. When you login, this initialization is done for you; look at your ~/.profile file (~/.cshrc if you use tcsh). You have to explicitly include this line if you're submitting a batch job: | ||||||||
Added: | ||||||||
> > | shopt -s expand_aliases # This line is only necessary if you're using bash | |||||||
source /usr/nevis/adm/nevis-init.sh | ||||||||
Line: 102 to 103 | ||||||||
| ||||||||
Changed: | ||||||||
< < | If you use the vanilla environment (see below), as most users at Nevis must, for a job to be "pre-empted" means that it is killed, and must start again from the beginning. | |||||||
> > | If you use the vanilla environment (see above), as most users at Nevis must, for a job to be "pre-empted" means that it is killed, and must start again from the beginning. | |||||||
To get an idea of your user resource consumption and how it compares to other users, use these commands: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Details of running condor at Nevis | ||||||||
Line: 6 to 6 | ||||||||
On this page:
| ||||||||
Added: | ||||||||
> > | Warning signsIf your condor script or program does any of the following, it's a warning sign that your job might crash (or worse, crash the cluster):Home directoryBad idea: Referring explicitly to your home directory, even to read a file. To write a file directly to your home directly from a condor job is an even worse idea. You probably want to read from a/share partition, and write to a /data partition.
If you change the default directory in the middle of a condor script or program, you'll wreak havoc on condor's standard file-transfer commands, and might have problems with disk sharing. Stick to the directory that condor assigns you.
If you're clever, you can include lines like these in your script:
| |||||||
Submitting batch jobsMemory limits |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Added: | ||||||||
> > |
Details of running condor at NevisOn this page:
Submitting batch jobsMemory limitsThe systems on the condor batch cluster have enough RAM for 1GB/processing queue. This means if your job uses more than 1GB of memory, there can be a problem. For example, if your job required 2GB of memory, and a condor batch node had 16 queues, then your 16 jobs will require 32GB of RAM, twice as much as the machine has. The machine will start swapping memory pages continuously, and essentially halt. To keep this from happening, condor will automatically cancel a job that requires more than 1GB of RAM. Unfortunately, condor has a problem estimating the amount of memory required by a running job: if a program uses threads, it will tend to overestimate; if a program uses shared libraries, it tends to underestimate. Therefore, if you find that your large simulation program is being "spontaneously" canceled, look at its memory use.Do you want 10,000 e-mails?By default, condor will send you an e-mail message as each of your jobs completes. If you've submitted 10,000 jobs, that means 10,000 e-mails. This can clog the mail server, and make your life miserable. Therefore, the following has been made default at Nevis:Notification = ErrorThis means that condor will only send you an e-mail if there's an error while running the job. Don't override it! Use the vanilla environmentUnless you've specifically used the condor_compile![]() universe = vanilla condor log filesIf you want to see the condor daemons' log files for a machine with the namehostname , look in /a/data/<hostname>/condor/log . For example:
# ls -blrth /a/data/karthur/condor/log -rw-r--r-- 1 condor condor 153 2010-04-13 15:07 StarterLog -rw-r--r-- 1 condor condor 473K 2010-04-13 16:29 SchedLog -rw-r--r-- 1 root root 591K 2010-04-13 16:29 MasterLog -rw-r--r-- 1 root root 788K 2010-04-13 17:15 StartLog -rw-r--r-- 1 root root 562K 2010-04-13 17:25 NegotiatorLog -rw-r--r-- 1 root root 296K 2010-04-13 17:25 CollectorLog About the batch clusterBatch managerThe system responsible for administering batch services on the general cluster iscondor.nevis.columbia.edu . Users typically do not log in to this machine directly; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of condor.nevis.columbia.edu may be completely transparent to you.
Condor status and usageYou can see how much of the batch cluster is in use, and by whom:
condor_submit command. For example, if you submit a job from a Neutrino system, it will run on the Neutrino cluster; if you submit a job from kolya or karthur , it runs on the general cluster; if you submit a job from xenia , it runs on the ATLAS cluster.
Fair useThe condor system is most efficient when it's handling a large number of small jobs. Long jobs tend to clog up the queues, and prevent others from doing their work. As of Feb-2010, there is no system that gives some groups or users higher priority than others. However, condor comes with a default scheme![]() ![]()
condor_userprio -allusersThe larger the number, the lower your priority in comparison to the other users listed. What processing power is availableThe following commands will show you the machines available to run your jobs, their status, and their resources:condor_status condor_status -serverObviously, some machines are more powerful than others. Before you arbitrarily decide that only the most powerful machines are good enough for your jobs, consider:
![]() Rank = MipsWith all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your submit file ![]() Requirements = (Mips > 2000)This would restrict your job to the fastest processors on the cluster. All the machines on the batch farm are not the sameThe batch farm is a heterogeneous collection of machines. If you're having problems with programs crashing on some systems but not on others, please read this page on compiler tools![]() "Why isn't my job running on all the machines in the batch farm?"There may be several reasons:The heterogenous clusterNot all machines in the farm are the same; they have different amounts of memory, disk space, and occasionally even installed libraries. Make sure you use the standardized compiler![]() # If you're using bash: shopt -s expand_aliases source /usr/nevis/adm/nevis-init.sh setup root geant4Finally, don't forget to set initialdir in your condor submit file.
The job requirementsThere may be something explicit or implicit in the resources required to run your job. To pick an unrealistic example, if you job requiresksh and that shell isn't installed on machine, then it won't execute on the cluster. A more practical example: If you have the following in your job submit file:
Requirements = ( Memory > 1024 )then your job won't execute if the amount of memory per job queue is 1024 or less, including those machines with 1023 MB per queue to due rounding in the memory calculation. If you think your job with ID 4402 should be able to execute on machine batch04 , you can compare what condor thinks are the job's requirements against what the machine offers:
condor_q -long -global 4402 condor_status -long batch04Another clue ![]() condor_q . If you have a job held with an ID of 44.20:
condor_q -analyze 44.20 Suspended jobsAs noted elsewhere on this page, we generally use the vanilla universe at Nevis. This means if a job is suspended on a given machine, it can only continue on that particular machine. If that machine is running other jobs, then the suspended job must wait.Extra disk spaceIn addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They were initially used by the ATLAS and D0 groups, but can be made available to other groups as the need arises. These disks are available via automount on the Linux cluster; each has a capacity of about 1.5TB. The names of these RAID arrays are:
cd /a/data/condor/array2/atlas/ mkdir $user cd $user # ... create whatever files you wantImportant! If you're skimming this page, stop and read the following paragraph! The files on these /data partitions, like those on the /data partitions of any other systems on the Nevis cluster, are not backed up. They are stored on RAID5![]() |