Running Condor Jobs
Condor is a batch processing system for Unix designed to seek out machines which are idle and scavenge those wasted CPU cycles. This allows desktop workstations which are commonly unused in the evening and on weekends and holidays to be matched with people who have a need for compute power for batch jobs. Of course, dedicated compute servers are useful members of a Condor pool, too. Condor is configured so that interactive use of a computer has precedence over Condor batch jobs. If Condor is using your (otherwise idle) desktop machine and you start typing, Condor will immediately stop using your machine. See below for more details on how Condor works.
There are three steps to using Condor:
- Prepare your batch job for Condor
- Prepare your submit description file
- Submit the job
Finally, in some circumstances you'll need to make more sophisticated changes to how your condor jobs run. In particular, we may ask you to throttle back when you submit 100s or 1000s of jobs.
Preparing a Condor submission
Before submitting a batch job to Condor, you have to make sure you actually need Condor capabilities. For example, interactive, graphical programs will not run with Condor. Or, if you only have a single job which will take several days to run, you might as well run it in the background on your own machine by hand:
$ nohup do_a_lot_of_work > work_log &
In particular, very long-running jobs — more than about three days — which run in the Vanilla Universe (see next section) may never actually complete under Condor, and should be run by hand. If, on the other hand, you have to run the same program on multiple data sets, or if you can partition the problem into several independent programs, then you can profit from using Condor.
Vanilla vs. Standard Universes
The Biostatistics Condor pool is configured to offer the sorts of batch "universes," the Standard Universe, the Vanilla Universe and the Java Universe. It is important to know which universe you're going to use before preparing your batch submission file.
The Vanilla Universe can handle any batch job (Splus, R, SAS, etc) and is what most users will use. The Standard Universe should be used for C, C++ and FORTRAN because it has the most potential to speed up jobs. Checkpointing, the ability to migrate jobs to less busy machines, is only available in the Standard Universe. If you wish to use the Standard Universe you will need to recompile your C or FORTRAN program with special Condor libraries. In general, this is actually quite straightforward.
To recompile a program to work with Condor, you simply change your compile line to start with the command condor_compile
# Instead of this:
$ gcc -o harmonics harmonics.c -lgmp
# Use this:
$ condor_compile gcc -o harmonics harmonics.c -lgmp
This will handle all the necessary Condor library linking. You can not run a program compiled this way outside Condor. You need to log on to the Condor system for condor_compile to work.
Creating a submit description file
The submit description file is the file you use to tell Condor what you want it to do. I cover only the basics here, so you may wish to see the submission section of the user's manual for the gritty details. Note that in a submit description file, comments are indicated with a pound sign (#). Also, these files contain the line:
Universe = vanilla
to submit a standard universe job change the line to
Universe = standard
Here is a single job submission file:
# Run a code test in the Vanilla Universe. # This is our Universe. Universe = vanilla # This is the name of the program you want Condor to run. Executable = testone # This is the directory where you want the command to run. Iinitialdir = /z/Proj/annis/condor # This is highly recommended, and will help the system group keep # track of and find errors Log = /z/Proj/annis/condor/log Arguments = some_argument Output = out # All output goes into this file. Error = err # All errors go into this file. Queue
This should be considered the simplest useful submit description file. It says that we're using the Vanilla Universe to run the program test one with the argument some_argument in the directory /z/Proj/annis/condor. I want a log, so that goes into /z/Proj/annis/condor/log. Be sure to specify a log file in a directory that you can write to.
Since Condor programs cannot talk to your terminal, the Output and Error fields are files where you want the output to go. If you had a program that wanted data typed in, you could put that in a file and specify the option Input. The last option Queue says to use the above information to add the job to the Condor job queue.
Most people will be able to submit condor jobs from their desktop linux machines. Log into a local compute host if that is not the case for you.
See below for more information on setting up a submit description file to run more than one job.
Running R in the Vanilla Universe
Here's a trivial submit file for running R from condor. Note in particular the --vanilla argument option. This sets up R with the expectation that it is running via condor.
# Condor submission file for R. Universe = vanilla Executable = /s/bin/R initialdir = /z/Proj/annis/condor log = /z/Proj/annis/condor/log arguments = --vanilla input = R_input output = R_output error = err.$(Process) Queue
Submitting a Condor job
If the example submit file above is called submit_description_file you would submit the jobs to Condor with this command:
$ condor_submit submit_description_file
You can find a copy of the entire Condor manual, including command manuals, here. Section 2, the User's Manual, will be of most use to researchers.
For all of these make sure you're on a machine in the Condor pool and have fixed your path as described above. Also, all of the Condor commands listed below will take the command line argument -h for "help" which will give a list of all options available.
Use the command
$ condor_submit submit_description_file
Once you have submitted jobs to the Condor pool you may get a list of all your jobs with the command condor_q -submitter $USER. So, I would use this:
nova-4 /z/Proj/annis/condor $ condor_q -submitter annis -- Submitter: email@example.com : : nova-4 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 5.0 annis 4/3 12:44 0+21:19:11 R 0 6.6 hurt one 5.1 annis 4/3 12:44 0+20:37:32 R 0 6.7 hurt two 5.2 annis 4/3 12:44 0+19:00:15 R 0 6.8 hurt three 5.3 annis 4/3 12:44 0+21:24:12 R 0 7.2 hurt four 5.4 annis 4/3 12:44 0+19:28:25 I 0 7.2 hurt five 5.5 annis 4/3 12:44 0+21:06:18 R 0 6.7 hurt six 5.6 annis 4/3 12:44 0+19:58:39 R 0 7.2 hurt seven 5.7 annis 4/3 12:44 0+20:59:28 R 0 6.6 hurt eight 8 jobs; 1 idle, 7 running, 0 held
Notice that this will tell you when the job was submitted, how much run time it has accumulated, how big it is, etc. The ST column indicates if the job is Running or Idle.
Finally, you can get a listing of another user's jobs by putting their Unix login name after the -submitter option, or you may run condor_q -global for a listing of all jobs Condor is currently working on.
In the output above you'll notice that each job has a numeric ID, where the part before the decimal point is the cluster and the part after is the process. So, every job which was started by the same call to condor_submit will have the same cluster name.
The command condor_rm will allow you to remove both single jobs and entire clusters. If I wanted to remove just the idle job from above, I'd simply run condor_rm 5.4 and that job would be removed. If I wanted to remove the entire cluster, I simply do this: condor_rm 5
To get a list of all the machines available to the Condor pool and their states, use condor_status:
nova-4 /z/Proj/annis/condor $ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime firstname.lastname@example.org SOLARIS26 SUN4u Claimed Busy 0.998 512 0+01:57:00 email@example.com SOLARIS26 SUN4u Owner Idle 0.072 512 0+16:13:34 caph.biostat. SOLARIS26 SUN4u Claimed Busy 2.004 128 0+21:01:39 kaid.biostat. SOLARIS26 SUN4u Owner Idle 0.074 192 0+02:05:30 firstname.lastname@example.org SOLARIS26 SUN4u Owner Idle 0.027 320 0+21:40:10 email@example.com SOLARIS26 SUN4u Claimed Busy 0.000 320 0+00:00:02 firstname.lastname@example.org SOLARIS26 SUN4u Claimed Busy 0.782 128 0+00:00:04 email@example.com SOLARIS26 SUN4u Owner Idle 1.003 128 0+00:00:05 nova-3.biosta SOLARIS26 SUN4u Claimed Busy 1.016 256 0+07:43:02 nova-4 SOLARIS26 SUN4u Claimed Busy 1.008 256 0+00:16:12 firstname.lastname@example.org SOLARIS26 SUN4u Owner Idle 1.012 768 0+00:01:11 email@example.com SOLARIS26 SUN4u Owner Idle 1.000 768 4+00:34:29 polaris.biost SOLARIS26 SUN4u Unclaimed Idle 0.000 128 0+00:10:04 procyon.biost SOLARIS26 SUN4u Owner Idle 0.000 1152 0+01:55:44 Machines Owner Claimed Unclaimed Matched Preempting SUN4u/SOLARIS26 14 7 6 1 0 0 Total 14 7 6 1 0 0
This gives quite a lot of information. The most interesting field is State. The state can be "Claimed" which means Condor is using the machine, "Owner" which means someone is logged into the machine and using it and "Unclaimed" which means no one is using it, but there are no jobs Condor can run on it.
Submitting more than one job at a time
If the program test_one described above has three data files called data_1, data_2 and data_3 which all need to be processed, you can simply expand the submit file to look like this:
# Run several tests in the Vanilla Universe. # This is our Universe. Universe = vanilla # This is the name of the program you want Condor to run. Executable = testone # This is the directory where you want the command to run. Iinitialdir = /z/Proj/annis/condor # This is highly recommended, and will help the system group keep # track of and find errors Log = /z/Proj/annis/condor/log Arguments = data_1 Output = results_1 # All output goes into this file. Error = err.$(Process) # All errors go into this file. Queue Arguments = data_2 Output = results_2 # All output goes into this file. Error = err.$(Process) # All errors go into this file. Queue Arguments = data_3 Output = results_3 # All output goes into this file. Error = err.$(Process) # All errors go into this file. Queue
Notice that since Executable, Log, etc. don't change, you don't need to reset those for each copy of the program you wish to run. So, this file will submit three jobs to the Condor queue, each of which will be given a sequential Process number, which you can use in the configuration to name output and error files. In some cases you might want to change the Initialdir option for each job. If for some reason you wanted to run the exact same job many times, say for some randomized simulation, you could simply do this (assuming everything up to the Log line is the same):
Error = err.$(Process)
Output = results.$(Process)
In this case, the job will be run 50 times, with the results going into results.0, results.1, ... results.49.
The submit description file for jobs in the Standard Universe is the same as for the Vanilla Universe, except of course that you must set the Universe option to Standard.
Limiting Jobs to Condor-only Hosts
If you want your condor jobs to only run on machines that are dedicated to condor (i.e., that no one logs into directly), add this to either the Requirements or Rank field of your submit file: BCG_PULSAR == TRUE.
Condor and Multi-CPU and High-Memory Jobs
Condor works best if you tell it as much about your job as possible. In particular, if you are running multi-threaded or multi-processing jobs you must tell Condor that you need several CPUs. Otherwise Condor jobs will quickly make a bunch of compute servers unstable and probably unusable. The setting for that is request_cpus, as part of your regular condor submit file.
It is also helpful if you tell condor how much memory your job will need, especially if it quite large. The setting for that is request_memory, where the values can be things like 5G for five gigabytes, 300M for 300 megabytes, etc. To figure out how much space you are using, start a single job interactively, and run the command top and look at your memory usage. The value (in kilobytes) under the VIRT (for "virtual memory") is the value to use in the request_memory setting.
So, imagine you need a multi-processing job with high memory requirements. Run it interactively and start top in another window, and watch the memory use. It might say, for example, it's using 384923 Kb in virtual memory. You know you need four CPUs. Then you should add this to your submit file:
request_memory = 39M request_cpus = 4 Arguments = ...
If you have any questions about tailoring your job to work well with condor, please do send us a request and we can work with you to help you get the most out of Condor.
Some details on how Condor works
The Condor system has daemons running on every machine in the Condor pool. Those daemons keep track of how a machine is being used, and announce to the master Condor scheduler whether they are available to run jobs. If someone is logged into their machine and actively using it (typing, moving the mouse, running compute jobs) then the machine will announce that it currently has an owner and cannot do other work. If on the other hand no one is logged in, or if no one has touched a mouse or keyboard for a while, the machine will tell the master scheduler that it is available to handle jobs. If the master has jobs waiting, it will send a job to the idle machine.
If after accepting a job from the master scheduler someone logs into the machine, Condor will suspend the Condor job and will wait about 10 minutes. If after 10 minutes the machine is still being used, the job will be taken off the machine. In the Standard Universe, Condor can actually move a running program to another machine and to continue computation where it left off. This is called checkpointing. In the Vanilla universe, the job is killed, and the master scheduler will try to start the program somewhere else, but in that case it must start over from the beginning.
If you're the person who logs into a machine where a Condor job is running and decide to run top you'll see something like this:
last pid: 23470; load averages: 1.11, 0.83, 0.82 14:16:49 32 processes: 30 sleeping, 1 running, 1 on cpu CPU states: % idle, % user, % kernel, % iowait, % swap Memory: 256M real, 98M free, 12M swap in use, 238M swap free PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 21463 annis 1 -10 5 2784K 1768K stop 78:49 89.67% condor_exec.5.5 23447 root 1 33 0 1968K 1720K sleep 0:01 4.39% sshd 23470 annis 1 0 0 1512K 1352K cpu 0:00 1.26% top 23450 annis 1 20 0 1624K 1224K sleep 0:00 0.58% ksh 19613 root 1 33 0 4200K 3088K sleep 4:50 0.08% condor_startd 19612 root 1 33 0 4048K 2608K sleep 2:51 0.01% condor_master
Notice that although top still shows the Condor job as using the most CPU, the load will start going down, and will drop to normal in about five minutes. Also notice that the STATE field indicates that the Condor job is stopped and while it has accumulated a lot of CPU time, it will not use any more until you log off and it will be moved off after about 10 minutes.
Finally, Condor does a good job of making fair use of resources. If only a few machines are free to accept jobs, and you submit 5 jobs, and some other user submits, say, 70, your jobs will run at a much higher priority than the other user's jobs. So, the more jobs you have in the Condor queue, the lower the priority of all of them. This prevents situations where someone submits 100s of jobs on Monday and locks out everyone else for the week.