Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Services at Nevis | ||||||||
Line: 12 to 12 | ||||||||
The system responsible for administering batch services is condor.nevis.columbia.edu . Users typically cannot log in to this machine; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of condor.nevis.columbia.edu may be completely transparent to you. | ||||||||
Added: | ||||||||
> > | To use any of the condor commands given below, you have to set it up:
setup condor Condor status and usageThe CondorView![]() | |||||||
Fair useThe condor system is most efficient when it's handling a large number of small jobs. Long jobs tend to clog up the queues, and prevent others from doing their work. | ||||||||
Line: 24 to 32 | ||||||||
If you use the vanilla environment (see below), as most users at Nevis must, for a job to be "pre-empted" means that it is suspended and until that same machine has an available queue. | ||||||||
Added: | ||||||||
> > | To get an idea of your user resource consumption and how it compares to other users, use these commands:
condor_userprio -allusersThe larger the number, the lower your priority in comparison to the other users listed. | |||||||
What processing power is availableThe following commands will show you the machines available to run your jobs, their status, and their resources: | ||||||||
Line: 53 to 67 | ||||||||
"Why isn't my job running on all the machines in the batch farm?" | ||||||||
Changed: | ||||||||
< < | You didn't read the previous section, did you? | |||||||
> > | There may be several reasons:
The heterogenous cluster | |||||||
Changed: | ||||||||
< < | Here it is again: Not all machines in the farm are the same; they have different amounts of memory, disk space, and occasionally even installed libraries. Make sure you use the standardized compiler![]() | |||||||
> > | Not all machines in the farm are the same; they have different amounts of memory, disk space, and occasionally even installed libraries. Make sure you use the standardized compiler![]() | |||||||
You'll also want to set up the standard Nevis environment explicitly in your jobs. If you look at the example described below, you'll see that the shell scripts all contain command such as:
# If you're using bash: | ||||||||
Line: 65 to 81 | ||||||||
Finally, don't forget to set initialdir in your condor submit file. | ||||||||
Added: | ||||||||
> > | The job requirementsThere may be something explicit or implicit in the resources required to run your job. To pick an unrealistic example, if you job requiresksh and that shell isn't installed on machine, then it won't execute on the cluster. A more practical example: If you have the following in your job submit file:
Requirements = ( Memory > 1024)then your job won't execute if the amount of memory per job queue is 1024 or less, including those machines with 1023 MB per queue to due rounding in the memory calculation. If you think your job with ID 4402 should be able to execute on machine batch04 , you can compare what condor thinks are the jobs requirements against what the machine offers:
condor_q -long -global 4402 condor_status -long batch04 Suspended jobsAs noted elsewhere on this page, we generally use the vanilla universe at Nevis. This means if a job is suspended on a given machine, it can only continue on that particular machine. If that machine is running other jobs, then the suspended job must wait. | |||||||
Extra disk spaceIn addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They were initially used by the ATLAS and D0 groups, but can be made available to other groups as the need arises. These disks are available via automount![]() | ||||||||
Line: 123 to 157 | ||||||||
initialdir = /a/data/tanya/seligman/kit/TestArea/ | ||||||||
Changed: | ||||||||
< < | Important: Think about how you handle your job's files | |||||||
> > | Important: Think about how you transfer your job's files | |||||||
Picture this: You submit a condor batch procedure that runs thousands of jobs. Each one of those jobs reads and/or writes directly into a directory on your server, accessed via NFS. Let's say there are 200 batch queues in the Nevis condor system. That means that 200 jobs are trying to access the disk on your server at once. |