Difference: DiskSharing (4 vs. 5)

Revision 52013-01-16 - WilliamSeligman

Line: 1 to 1
 
META TOPICPARENT name="LinuxCluster"

Disk sharing and the condor batch system

Line: 61 to 61
 

Let condor transfer the files

Changed:
<
<
Don't rely on your job reading and writing files directly to a particular directory. Use commands like the following in your condor command file; look them up in the User's Manual:
>
>
Don't rely on your job reading and writing files directly to a particular directory. Use commands like the following in your condor command file; look them up in the User's Manual:
 
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
Line: 69 to 69
 initialdir = ...where your inputs and outputs are located...
Changed:
<
<
This will transfer your input files to the condor master server once, instead of 300 times; as your job is executing, it will write the output on a local area of the machine that's running the job. Once the job has finished executing, it will transfer the output file to the initialdir directory.
>
>
This will transfer your input files to the condor master server once, instead of 300 times; as your job is executing, it will write the output on a local area of the machine that's running the job. Once the job has finished executing, it will transfer the output file to the initialdir directory.
 
Changed:
<
<
Unless you specify a file's full pathname in the condor command file, the files will be copied to and from initialdir (see below).
>
>
Unless you specify a file's full pathname in the condor command file, the files will be copied to and from initialdir (see below).
  There's one more thing to think about: If initialdir is located machine A (via automount), and you submit the job from machine B, then condor will use machine B to transfer the files to the directory on machine A. For example, if these lines are in your condor submit file:
initialdir = /a/data/kolya/jsmith
queue 10000
Changed:
<
<
and you submit the job on the machine karthur, then as each of the 10,000 jobs terminates karthur will automount /a/data/kolya/jsmith on kolya to write the file; see condor_shadow for more information. This has not yet caused any machines at Nevis to crash, but it has caused both machines to become annoyingly slow.
>
>
and you submit the job on the machine karthur, then as each of the 10,000 jobs terminates karthur will automount /a/data/kolya/jsmith on kolya to write the file; see condor_shadow for more information. This has not yet caused any machines at Nevis to crash, but it has caused both machines to become annoyingly slow.
 

Plan where your log files will go

Line: 89 to 89
 log = mySimulation-$(Process).log
Changed:
<
<
These lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. If initialdir points to a sub-directory of /share, sooner or later you'll fill up that disk. Everyone in your group can be affected.
>
>
These lines define where the job's output, error, and log files are written. If you submit one job, the above lines are fine. If you submit 10,000 jobs, you'll create 30,000 files. If you submit mySimulation1, mySimulation2, ... you'll create an indefinite number of files. If initialdir points to a sub-directory of /share, sooner or later you'll fill up that disk. Everyone in your group can be affected.
  The general solution is to not write your output files into your /share directory. You can do this via one of the following:
  • submit your job from a directory on the /data partition; e.g., /a/data/<file-server>/<$USER>/
Changed:
<
<
  • use the initialdir command to tell condor where inputs and outputs are located.
>
>
  • use the initialdir command to tell condor where inputs and outputs are located.
  Specifying the complete automount path is not a good idea for log files, because then every job will try to write to that directory and you run the risk of crashing the server. Do not do the following:
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback