---+ Batch Services at Nevis This is a description of the batch job submission services available on the [[http://www.nevis.columbia.edu/linux/][Linux cluster]] at [[http://www.nevis.columbia.edu][Nevis Labs]]. %TOC% ---++ Batch and disk services The system responsible for administering batch services is ==condor.nevis.columbia.edu==. Users typically cannot log in to this machine; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of ==condor.nevis.columbia.edu== may be completely transparent to you. In addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They were initially used by the ATLAS and D0 groups, but can be made available to other groups as the need arises. These disks are available via [[http://www.nevis.columbia.edu/linux/automount.html][automount]] on the Linux cluster; each has a capacity of about 1.5TB. The names of these RAID arrays are: * =/a/data/condor/array1/= * =/a/data/condor/array2/= For example, the permissions on the drives have been set so that you can do the following from any machine on the Linux cluster (if you're a member of the ATLAS group): <verbatim>cd /a/data/condor/array2/atlas/ mkdir $user cd $user # ... create whatever files you want </verbatim> *Important! If you're skimming this page, stop and read the following paragraph!* The files on these =/data= partitions, like those on the /data partitions of any other systems on the Nevis cluster, are not backed up. They are stored on [[http://www.acnc.com/04_01_05.html][RAID5]] arrays, which are a reliable form of storage; there is monitoring software that warns if any individual drives have failed. However, RAID arrays have been known to fail (and we've had at least one such failure at Nevis). If you have any critical data stored on these drives, make sure you backup the files yourself. One more time: the disks on these partitions *are not backed up!* ---++ Submitting batch jobs The batch job submission system we're using at Nevis is [[http://www.cs.wisc.edu/condor/][Condor]], developed at the University of Wisconsin. You can learn more about Condor from the [[http://www.cs.wisc.edu/condor/manual/v7.0/2_Users_Manual.html][User's Manual]]. To use Condor at Nevis, the simplest way is to use the [[http://www.nevis.columbia.edu/software/local.html][setup]] command: <verbatim>setup condor</verbatim> This will set the variable ==$CONDOR_CONFIG== to ==~condor/etc/condor_config==, and add ==~condor/bin== to your ==$PATH==. ---+++ What processing power is available The following commands will show you the machines available to run your jobs, their status, and their resources: <verbatim>condor_status condor_status -server </verbatim> Obviously, some machines are more powerful than others. Before you arbitrarily decide that only the most powerful machines are good enough for your jobs, consider: * It's true, a machine that's 1/4 as fast will take 4 times as long to execute your jobs. However, the demand for the faster machine may be more than four times as much; it's possible that your job will sit waiting in the queue longer than it would have taken to run on the slower box. * The CPU cycles on the slower machines are presently being wasted. You might be able to put them to some use. * If you have a large number of jobs to submit, the slower machine can chug away at a couple of them while the rest are waiting to execute on the faster processors. The best way to tell Condor that you'd prefer your job to execute on the faster machines is to use the [[http://www.cs.wisc.edu/condor/manual/v7.0/2_5Submitting_Job.html#SECTION00352300000000000000][Rank]] attribute in your submit file: <verbatim>Rank = Mips</verbatim> With all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your [[http://www.cs.wisc.edu/condor/manual/v7.0/2_5Submitting_Job.htm][submit file]]: <verbatim>Requirements = (Mips > 2000) </verbatim> This would restrict your job to the fastest processors on the cluster. ---+++ Use the vanilla environment We have discovered that the vanilla environment described in the Condor manual does not behave exactly as documented at Nevis. The following advice may be helpful: Unless you've specifically used the [[http://www.cs.wisc.edu/condor/manual/v6.8/condor_compile.html#man-condor-compile][condor_compile]] command to compile your programs, you *must* submit your jobs in the "vanilla" universe. In particular, the Athena and D0 distribution kits do *not* use condor_compile, and must have the following line in a command script that makes use of those kits: <verbatim>universe = vanilla</verbatim> ---+++ Handling disk files As you read through the [[http://www.cs.wisc.edu/condor/manual/v7.0/2_Users_Manual.html][User's Manual]] chapter on [[http://www.cs.wisc.edu/condor/manual/v7.0/2_5Submitting_Job.html][job submission]], note that we use a [[http://www.nevis.columbia.edu/linux/automount.html][shared file system]] at Nevis. Because we use a [[http://www.cs.wisc.edu/condor/manual/v7.0/2_5Submitting_Job.html#SECTION00353000000000000000][shared file system]] at Nevis that's based on [[http://www.nevis.columbia.edu/linux/automount.html][automount]], it's a good idea to include the [[http://www.cs.wisc.edu/condor/manual/v7.0/condor_submit.html#35942][initialdir]] attribute in your command scripts. For example, when I submit a script that makes use of files in my directory ==/a/data/tanya/seligman/kit/TestArea/==, I include the following line in my command script to make sure the executing machine has correctly mounted the directory: <verbatim>initialdir = /a/data/tanya/seligman/kit/TestArea/ </verbatim> ---+++ sh-style shells versus csh-style shells There appears to be a difference in the way the [[http://www.nevis.columbia.edu/cgi-bin/man.sh?man=sh][sh]] and [[http://www.nevis.columbia.edu/cgi-bin/man.sh?man=csh][csh]] shells handle files in Condor. In ==sh==, ==bash==, or ==zsh== (the default at Nevis) the examples in the Condor manual basically work as they are. In ==csh== or ==tcsh==, the scripts will fail due to "file not found" errors unless you do one of the following: * Transfer all working files to the executing machine with the lines: <verbatim>should_transfer_files = YES when_to_transfer_output = ON_EXIT </verbatim> * Include the full path name when you reference any file in your command script. For example, in my scripts, the following line fails: <verbatim>executable = athena.csh</verbatim> but the following succeeds: <verbatim>executable = /a/data/tanya/seligman/kit/TestArea/athena.csh</verbatim> ---+++ All the machines on the batch farm are not the same The batch farm is a heterogenous collection of machines; that is, they're not all running the same version of Fedora, nor do they all have the same version of [[http://www.nevis.columbia.edu/cgi-bin/man.sh?man=gcc][gcc]] installed. If you're having problems with programs crashing on some systems but not on others, please read this page on [[http://www.nevis.columbia.edu/software/compiler-tools.html][compiler tools]] that can help solve this problem. ---+++ "Why isn't my job running on all the machines in the batch farm?" You didn't read the previous section, did you? Here it is again: Not all machines in the farm are the same; they run different versions of Fedora. Make sure you use the [[http://www.nevis.columbia.edu/software/compiler-tools.html][standardized compiler]] to compile your programs. You'll also want to set up the standard Nevis environment explicitly in your jobs. If you look at the example described below, you'll see that the shell scripts all contain command such as: <verbatim>source /usr/nevis/adm/nevis-init.sh setup root geant4</verbatim> Finally, don't forget to set ==initialdir== in your condor submit file. ---++ Examples ---+++ The standard condor examples If you're just starting to learn Condor, a good way to start is to copy the Condor examples: <verbatim>cp -arv ~condor/condor-7.0.1/examples . cd examples </verbatim> Read the README file; type ==make== to compile the programs; type ==sh submit== to submit a few test jobs. You may notice that the sh_loop script will not execute; it will sit in the "Idle" state indefinitely. It won't execute unless you submit it in the ==vanilla== universe as described below. ---+++ Examples that incorporate the tips on this page Many of the above tips, and others, have been combined into a set of example scripts. They are in ==~seligman/condor/==; start with the README file, which will point you to the other relevant files in the directory. ---+++ Submitting multiple jobs with one =condor_submit= command An ATLAS example: [[http://www.nevis.columbia.edu/twiki/bin/view/ATLAS/RunningMulitpleJobsOnCondor][Running Multiple Jobs On Condor]] As of Jun-2008, you can find several examples of multiple job submission in ==/a/home/riverside/seligman/nusong/aria/work==; these go further with the tips in the above link, to generate both numeric and text parameters that vary according to condor's process ID. Look in the ==*.cmd== files, which will lead you in turn to some of the ==*.sh== files in that directory. There are hopefully enough comments in those scripts to get you started.
This topic: Main
>
CosmicRay
>
Condor
Topic revision: r6 - 2009-07-24 - WilliamSeligman
Copyright © 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback