Condor Best Practices
Please feel free to contact us with questions about any of the following items.
- Do not schedule condor jobs with cron, or use a script that uses sleep to submit jobs on a schedule. If you need to sequence submissions, use DAGman.
- The faster individual jobs run the better, down to 10-15 minutes or so. Avoid jobs that take days to run — you lose less work if a job is preempted.
- If you are regularly submitting more than 5000 jobs at a time, please contact us.
- If you are running something new, submit only a few jobs as a test first.
- Do not run condor jobs from your home directory. Contact us to set up a computational disk space for you.
- Break your data files into smaller sizes if possible. The built-in condor file transfer facility works ok for files up to about 20M. Congestion problems appear above that — and at even smaller file sizes if you are submitting many jobs.
- The /scratch directory is available on condor compute nodes.
- Do not just put files directly into /scratch, but make sure there is a subdirectory for your job.
- Files in /scratch are not backed up.
When running R jobs in condor, be sure to include the --vanilla option.
Universe = vanilla Executable = /s/bin/R initialdir = /z/Comp/mycondordir log = /z/Comp/mycondordir/log arguments = --vanilla ...
Automatic Job Restarts
We have more than 200 compute nodes in our Condor pool (as of Dec 2016). Sometimes one of those machines has a problem and that can cause Condor jobs to fail randomly. Adding the following to your Condor submit file will cause a failing job to be retried a few times, and give it a chance to find a better behaved host to run on.
# ----- # Send the job to Held state on failure. on_exit_hold = (ExitBySignal == True) || (ExitCode != 0) # Periodically retry the jobs every 10 minutes, up to a maximum of 5 retries. periodic_release = (NumJobStarts < 5) && ((CurrentTime - EnteredCurrentStatus) > 600) # -----