Condor Best Practices

Please feel free to contact us with questions about any of the following items.

Submitting

  • Do not schedule condor jobs with cron, or use a script that uses sleep to submit jobs on a schedule. If you need to sequence submissions, use DAGman.
  • The faster individual jobs run the better, down to 10-15 minutes or so. Avoid jobs that take days to run — you lose less work if a job is preempted.
  • If you are regularly submitting more than 5000 jobs at a time, please contact us.
  • If you are running something new, submit only a few jobs as a test first.

Data

  • Do not run condor jobs from your home directory. Contact us to set up a computational disk space for you.
  • Break your data files into smaller sizes if possible. The built-in condor file transfer facility works ok for files up to about 20M. Congestion problems appear above that — and at even smaller file sizes if you are submitting many jobs.
  • The /scratch directory is available on condor compute nodes.
    • Do not just put files directly into /scratch, but make sure there is a subdirectory for your job.
    • Files in /scratch are not backed up.

Recipes

R

When running R jobs in condor, be sure to include the --vanilla option.

Universe = vanilla
Executable = /s/bin/R
initialdir = /z/Comp/mycondordir
log = /z/Comp/mycondordir/log

arguments	=	--vanilla ...

Automatic Job Restarts

We have more than 200 compute nodes in our Condor pool (as of Dec 2016). Sometimes one of those machines has a problem and that can cause Condor jobs to fail randomly. Adding the following to your Condor submit file will cause a failing job to be retried a few times, and give it a chance to find a better behaved host to run on.

# -----
# Send the job to Held state on failure.
on_exit_hold = (ExitBySignal == True) || (ExitCode != 0)
 
# Periodically retry the jobs every 10 minutes, up to a maximum of 5 retries.
periodic_release = (NumJobStarts < 5) && ((CurrentTime - EnteredCurrentStatus) > 600)
# -----