Condor Best Practices
Please feel free to contact us with questions about any of the following items.
- Do not schedule condor jobs with cron, or use a script that uses sleep to submit jobs on a schedule. If you need to sequence submissions, use DAGman.
- The faster individual jobs run the better, down to 10-60 minutes or so. Avoid jobs that take days to run — you lose less work if a job is preempted.
- If you are regularly submitting more than 5000 jobs at a time, please contact us.
- If you are running something new, submit only a few jobs as a test first.
- Do not run condor jobs from your home directory. Contact us to set up a computational disk space for you.
- Break your data files into smaller sizes if possible. The built-in condor file transfer facility works ok for files up to about 20M. Congestion problems appear above that — and at even smaller file sizes if you are submitting many jobs.
- The /scratch directory is available on condor compute nodes.
- Do not just put files directly into /scratch, but make sure there is a subdirectory for your job.
- Files in /scratch are not backed up.
Limiting How Many Jobs Run
If your condor jobs are very I/O intensive — that is, they read or write a lot of data — we will ask you to put a limit on how many jobs you run concurrently. Otherwise, your jobs will slow down the file servers too much and cause trouble for other users.
Condor has the abstract idea of concurrent resource limits. The BCG default is for any named resource to have a limit of 1000 units. By using your user name as a resource name, you can easily limit your jobs. Just divide 1000 by the number of jobs you want to run. As an example, let's say you want to only run 50 copies of your job at a time. So, 1000/50 = 20. To ask for 20 resource units per job, just put this in your submit file (substitute your user name for annis):
concurrency_limits = annis:20 # use your user name, not 'annis'
And this will limit your jobs to 50 concurrent runs across the condor pool.
When running R jobs in condor, be sure to include the --vanilla option.
Universe = vanilla Executable = /s/bin/R initialdir = /z/Comp/mycondordir log = /z/Comp/mycondordir/log arguments = --vanilla ...
Automatic Job Restarts
We have more than 200 compute nodes in our Condor pool (as of Dec 2016). Sometimes one of those machines has a problem and that can cause Condor jobs to fail randomly. Adding the following to your Condor submit file will cause a failing job to be retried a few times, and give it a chance to find a better behaved host to run on.
# ----- # Send the job to Held state on failure. on_exit_hold = (ExitBySignal == True) || (ExitCode != 0) # Periodically retry the jobs every 10 minutes, up to a maximum of 5 retries. periodic_release = (NumJobStarts < 5) && ((CurrentTime - EnteredCurrentStatus) > 600) # -----