Running Condor Jobs

Condor is a batch processing system for Unix designed to seek out machines which are idle and scavenge those wasted CPU cycles. This allows desktop workstations which are commonly unused in the evening and on weekends and holidays to be matched with people who have a need for compute power for batch jobs. Of course, dedicated compute servers are useful members of a Condor pool, too. Condor is configured so that interactive use of a computer has precedence over Condor batch jobs. If Condor is using your (otherwise idle) desktop machine and you start typing, Condor will immediately stop using your machine. See below for more details on how Condor works.

Using Condor

There are three steps to using Condor:

  1. Prepare your batch job for Condor
  2. Prepare your submit description file
  3. Submit the job

Finally, in some circumstances you’ll need to make more sophisticated changes to how your condor jobs run. In particular, we may ask you to throttle back when you submit 100s or 1000s of jobs.

Preparing a Condor submission

Before submitting a batch job to Condor, you have to make sure you actually need Condor capabilities. For example, interactive, graphical programs will not run with Condor. Or, if you only have a single job which will take several days to run, you might as well run it in the background on your own machine by hand:

$ nohup do_a_lot_of_work > work_log &

In particular, very long-running jobs — more than about three days — which run in the Vanilla Universe (see next section) may never actually complete under Condor, and should be run by hand. If, on the other hand, you have to run the same program on multiple data sets, or if you can partition the problem into several independent programs, then you can profit from using Condor.

Vanilla vs. Standard Universes

The Biostatistics Condor pool is configured to offer the sorts of batch “universes,” the Standard Universe, the Vanilla Universe and the Java Universe. It is important to know which universe you’re going to use before preparing your batch submission file.

The Vanilla Universe can handle any batch job (Splus, R, SAS, etc) and is what most users will use. The Standard Universe should be used for C, C++ and FORTRAN because it has the most potential to speed up jobs. Checkpointing, the ability to migrate jobs to less busy machines, is only available in the Standard Universe. If you wish to use the Standard Universe you will need to recompile your C or FORTRAN program with special Condor libraries. In general, this is actually quite straightforward.

To recompile a program to work with Condor, you simply change your compile line to start with the command condor_compile

# Instead of this:
$ gcc -o harmonics harmonics.c -lgmp
# Use this:
$ condor_compile gcc -o harmonics harmonics.c -lgmp

This will handle all the necessary Condor library linking. You can not run a program compiled this way outside Condor. You need to log on to the Condor system for condor_compile to work.

Creating a submit description file

The submit description file is the file you use to tell Condor what you want it to do. I cover only the basics here, so you may wish to see the submission section of the user’s manual for the gritty details. Note that in a submit description file, comments are indicated with a pound sign (#). Also, these files contain the line:

Universe = vanilla

to submit a standard universe job change the line to

Universe = standard

Here is a single job submission file:

# Run a code test in the Vanilla Universe.

# This is our Universe.
Universe = vanilla

# This is the name of the program you want Condor to run.
Executable = testone

# This is the directory where you want the command to run.
Iinitialdir = /z/Proj/annis/condor

# This is highly recommended, and will help the system group keep
# track of and find errors
Log = /z/Proj/annis/condor/log

Arguments  = some_argument
# All output goes into this file.
Output     = out
# All errors go into this file.
Error      = err

This should be considered the simplest useful submit description file. It says that we’re using the Vanilla Universe to run the program test one with the argument some_argument in the directory /z/Proj/annis/condor. I want a log, so that goes into /z/Proj/annis/condor/log. Be sure to specify a log file in a directory that you can write to.

Since Condor programs cannot talk to your terminal, the Output and Error fields are files where you want the output to go. If you had a program that wanted data typed in, you could put that in a file and specify the option Input. The last option Queue says to use the above information to add the job to the Condor job queue.

Most people will be able to submit condor jobs from their desktop linux machines. Log into a local compute host if that is not the case for you.

See below for more information on setting up a submit description file to run more than one job.

Running R in the Vanilla Universe

Here’s a trivial submit file for running R from condor. Note in particular the –vanilla argument option. This sets up R with the expectation that it is running via condor.

# Condor submission file for R.

Universe = vanilla
Executable = /s/bin/R
initialdir = /z/Proj/annis/condor
log = /z/Proj/annis/condor/log

arguments	=	--vanilla
input		= 	R_input
output		=	R_output
error           =       err.$(Process)

Submitting a Condor job

If the example submit file above is called submit_description_file you would submit the jobs to Condor with this command:

$ condor_submit submit_description_file

Command/Task Summary

You can find a copy of the entire Condor manual, including command manuals, here. Section 2, the User’s Manual, will be of most use to researchers.

For all of these make sure you’re on a machine in the Condor pool and have fixed your path as described above. Also, all of the Condor commands listed below will take the command line argument -h for “help” which will give a list of all options available.

Submitting Jobs

Use the command

$ condor_submit submit_description_file

Listing Jobs

Once you have submitted jobs to the Condor pool you may get a list of all your jobs with the command condor_q -submitter $USER. So, I would use this:

nova-4 /z/Proj/annis/condor $ condor_q -submitter annis

-- Submitter: :  : nova-4
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   5.0   annis           4/3  12:44   0+21:19:11 R  0   6.6  hurt one          
   5.1   annis           4/3  12:44   0+20:37:32 R  0   6.7  hurt two          
   5.2   annis           4/3  12:44   0+19:00:15 R  0   6.8  hurt three        
   5.3   annis           4/3  12:44   0+21:24:12 R  0   7.2  hurt four         
   5.4   annis           4/3  12:44   0+19:28:25 I  0   7.2  hurt five         
   5.5   annis           4/3  12:44   0+21:06:18 R  0   6.7  hurt six          
   5.6   annis           4/3  12:44   0+19:58:39 R  0   7.2  hurt seven        
   5.7   annis           4/3  12:44   0+20:59:28 R  0   6.6  hurt eight        

8 jobs; 1 idle, 7 running, 0 held

Notice that this will tell you when the job was submitted, how much run time it has accumulated, how big it is, etc. The ST column indicates if the job is Running or Idle.

Finally, you can get a listing of another user’s jobs by putting their Unix login name after the -submitter option, or you may run condor_q -global for a listing of all jobs Condor is currently working on.

Removing Jobs

In the output above you’ll notice that each job has a numeric ID, where the part before the decimal point is the cluster and the part after is the process. So, every job which was started by the same call to condor_submit will have the same cluster name.

The command condor_rm will allow you to remove both single jobs and entire clusters. If I wanted to remove just the idle job from above, I’d simply run condor_rm 5.4 and that job would be removed. If I wanted to remove the entire cluster, I simply do this: condor_rm 5

Condor Status

To get a list of all the machines available to the Condor pool and their states, use condor_status:

nova-4 /z/Proj/annis/condor $ condor_status

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime SOLARIS26   SUN4u  Claimed    Busy       0.998   512  0+01:57:00 SOLARIS26   SUN4u  Owner      Idle       0.072   512  0+16:13:34
caph.biostat. SOLARIS26   SUN4u  Claimed    Busy       2.004   128  0+21:01:39
kaid.biostat. SOLARIS26   SUN4u  Owner      Idle       0.074   192  0+02:05:30 SOLARIS26   SUN4u  Owner      Idle       0.027   320  0+21:40:10 SOLARIS26   SUN4u  Claimed    Busy       0.000   320  0+00:00:02 SOLARIS26   SUN4u  Claimed    Busy       0.782   128  0+00:00:04 SOLARIS26   SUN4u  Owner      Idle       1.003   128  0+00:00:05
nova-3.biosta SOLARIS26   SUN4u  Claimed    Busy       1.016   256  0+07:43:02
nova-4        SOLARIS26   SUN4u  Claimed    Busy       1.008   256  0+00:16:12 SOLARIS26   SUN4u  Owner      Idle       1.012   768  0+00:01:11 SOLARIS26   SUN4u  Owner      Idle       1.000   768  4+00:34:29
polaris.biost SOLARIS26   SUN4u  Unclaimed  Idle       0.000   128  0+00:10:04
procyon.biost SOLARIS26   SUN4u  Owner      Idle       0.000  1152  0+01:55:44

                     Machines Owner Claimed Unclaimed Matched Preempting

     SUN4u/SOLARIS26       14     7       6         1       0          0

               Total       14     7       6         1       0          0

This gives quite a lot of information. The most interesting field is State. The state can be “Claimed” which means Condor is using the machine, “Owner” which means someone is logged into the machine and using it and “Unclaimed” which means no one is using it, but there are no jobs Condor can run on it.

The HTCondor project also maintains a page on the condor_q command on their site, which also contains a number of how-to’s you may find helpful as well.

Submitting more than one job at a time

If the program test_one described above has three data files called data_1, data_2 and data_3 which all need to be processed, you can simply expand the submit file to look like this:

# Run several tests in the Vanilla Universe.

# This is our Universe.
Universe = vanilla

# This is the name of the program you want Condor to run.
Executable = testone

# This is the directory where you want the command to run.
Iinitialdir = /z/Proj/annis/condor

# This is highly recommended, and will help the system group keep
# track of and find errors
Log = /z/Proj/annis/condor/log

Arguments  = data_1
Output     = results_1           # All output goes into this file.
Error      = err.$(Process)      # All errors go into this file.

Arguments  = data_2
Output     = results_2           # All output goes into this file.
Error      = err.$(Process)      # All errors go into this file.

Arguments  = data_3
Output     = results_3           # All output goes into this file.
Error      = err.$(Process)      # All errors go into this file.

Notice that since Executable, Log, etc. don’t change, you don’t need to reset those for each copy of the program you wish to run. So, this file will submit three jobs to the Condor queue, each of which will be given a sequential Process number, which you can use in the configuration to name output and error files. In some cases you might want to change the Initialdir option for each job. If for some reason you wanted to run the exact same job many times, say for some randomized simulation, you could simply do this (assuming everything up to the Log line is the same):

Error = err.$(Process)
Output = results.$(Process)
Queue 50

In this case, the job will be run 50 times, with the results going into results.0, results.1, … results.49.

The submit description file for jobs in the Standard Universe is the same as for the Vanilla Universe, except of course that you must set the Universe option to Standard.

Limiting Jobs to Condor-only Hosts

If you want your condor jobs to only run on machines that are dedicated to condor (i.e., that no one logs into directly), add this to either the Requirements or Rank field of your submit file: BCG_PULSAR == TRUE.

Condor and Multi-CPU and High-Memory Jobs

Condor works best if you tell it as much about your job as possible. In particular, if you are running multi-threaded or multi-processing jobs you must tell Condor that you need several CPUs. Otherwise Condor jobs will quickly make a bunch of compute servers unstable and probably unusable. The setting for that is request_cpus, as part of your regular condor submit file.

It is also helpful if you tell condor how much memory your job will need, especially if it quite large. The setting for that is request_memory, where the values can be things like 5G for five gigabytes, 300Mfor 300 megabytes, etc. To figure out how much space you are using, start a single job interactively, and run the command top and look at your memory usage. The value (in kilobytes) under the VIRT(for “virtual memory”) is the value to use in the request_memory setting.

So, imagine you need a multi-processing job with high memory requirements. Run it interactively and start top in another window, and watch the memory use. It might say, for example, it’s using 384923Kb in virtual memory. You know you need four CPUs. Then you should add this to your submit file:

request_memory = 39M
request_cpus = 4

Arguments = ...

If you have any questions about tailoring your job to work well with condor, please do send us a request and we can work with you to help you get the most out of Condor.

Some details on how Condor works

The Condor system has daemons running on every machine in the Condor pool. Those daemons keep track of how a machine is being used, and announce to the master Condor scheduler whether they are available to run jobs. If someone is logged into their machine and actively using it (typing, moving the mouse, running compute jobs) then the machine will announce that it currently has an owner and cannot do other work. If on the other hand no one is logged in, or if no one has touched a mouse or keyboard for a while, the machine will tell the master scheduler that it is available to handle jobs. If the master has jobs waiting, it will send a job to the idle machine.

If after accepting a job from the master scheduler someone logs into the machine, Condor will suspend the Condor job and will wait about 10 minutes. If after 10 minutes the machine is still being used, the job will be taken off the machine. In the Standard Universe, Condor can actually move a running program to another machine and to continue computation where it left off. This is called checkpointing. In the Vanilla universe, the job is killed, and the master scheduler will try to start the program somewhere else, but in that case it must start over from the beginning.

If you’re the person who logs into a machine where a Condor job is running and decide to run top you’ll see something like this:

last pid: 23470;  load averages:  1.11,  0.83,  0.82                   14:16:49
32 processes:  30 sleeping, 1 running, 1 on cpu
CPU states:     % idle,     % user,     % kernel,     % iowait,     % swap
Memory: 256M real, 98M free, 12M swap in use, 238M swap free

21463 annis      1 -10    5 2784K 1768K stop    78:49 89.67% condor_exec.5.5
23447 root       1  33    0 1968K 1720K sleep   0:01  4.39% sshd
23470 annis      1   0    0 1512K 1352K cpu     0:00  1.26% top
23450 annis      1  20    0 1624K 1224K sleep   0:00  0.58% ksh
19613 root       1  33    0 4200K 3088K sleep   4:50  0.08% condor_startd
19612 root       1  33    0 4048K 2608K sleep   2:51  0.01% condor_master

Notice that although top still shows the Condor job as using the most CPU, the load will start going down, and will drop to normal in about five minutes. Also notice that the STATE field indicates that the Condor job is stopped and while it has accumulated a lot of CPU time, it will not use any more until you log off and it will be moved off after about 10 minutes.

Finally, Condor does a good job of making fair use of resources. If only a few machines are free to accept jobs, and you submit 5 jobs, and some other user submits, say, 70, your jobs will run at a much higher priority than the other user’s jobs. So, the more jobs you have in the Condor queue, the lower the priority of all of them. This prevents situations where someone submits 100s of jobs on Monday and locks out everyone else for the week.