When running computationally intensive jobs, either directly or with HTCondor, it is important to understand just what your job needs for memory and CPU. The sections below describe different tools to understand what your computational jobs are doing.
What is the computer doing now?
A quick way to find out what a machine is doing right now is w.
nebula-1 ~ $ w
09:26:45 up 6 days, 1:25, 3 users, load average: 15.85, 15.73, 15.74
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
annis pts/1 fornax Wed12 5.00s 1.27s 0.06s w
zzzzzzzz pts/2 adhara Wed17 5days 3.65s 3.13s ssh nebula-2
yyyyyyy pts/4 adhara 09:22 2:45 0.13s 0.13s -bash
Notice the numbers at the end of the first line. That tells you how busy the machine is (it’s usually called the “load”). The first number is the 1 minute average, the second the 5, and the final the 15. A machine with a load above 1.0 is doing at least some work, and one with with a load of 15.85 is doing a good deal more.
I often run w by habit immediately after I log into a new machine, just to get a quick look at how busy it is. I recommend the same practice to anyone looking to run computationally intensive jobs.
The other main command to see what a machine is up to is the command htop (an improved version of top). When you run this you will get an updating report of what exactly the machine is doing. Hit the Q key to exit.
Here is an example of the top part of an htop screen, for a machine that has 32 CPUs:
This shows what every CPU core on the machine is doing. A core at 100% is completely busy. Notice in the lower right that the 1, 5, and 15 minute load numbers are given.
Here’s a different view on a different machine, this one with only 4 CPUs:
Notice that while the load isn’t very high compared to the number of CPUs available, a lot of memory is in use, as represented in the Mem and Swp lines. It is almost always a bad sign if there is a lot of swap (Swp) memory in use, which is why htop highlights that in red.
Under that summary of CPU and memory use is a listing of all the programs running on the computer. A lot of this information is related to underlying operating system functions. Anything running as the user “root” for example is just an OS service. Here’s an example only showing a few root tasks, and my htop program itself.
(You can run “htop -u $USERNAME” where $USERNAME is your unix login name, to show only your own programs.)
The columns that are most important for understanding your job are VIRT, RES, and CPU%. VIRT and RES show memory usage, and for HTCondor I recommend only paying attention to VIRT, and use that to determine your request_memory setting.
The CPU% line says how much of a single CPU the job is running. A single-process or single-threaded job should only ever get up to 100%. If your program is doing multi-threading or some other sort of parallel processing it might well show, say, 376%. That means it’s using nearly 4 CPU cores to do its work. In this situation, you’ll want to tell condor, with request_cpus, that you need 4 for this job.
For example, the bowtie tool has a command line option, --threads, where you can select how many CPUs to use. If you give it “–threads 4” then you should expect the CPU% to be near 400%, and you’ll need to tell htcondor that you need 4 CPUs.
Preparing for HTCondor
If you’re getting a job ready to run on htcondor, it’s good to run it first on the command line, and use htop to profile it, to find out how much memory it uses, and how many CPUs. Without proper profiling, you can get into a situation where your htcondor jobs fail unexpectedly due to resource contention, or they take much longer to complete than necessary.
Rules of Thumb
In general, if htop shows most CPUs busy, or shows a lot of swap memory in use, that machine isn’t really ready to do a lot of new work. If you start a bunch of local, high-compute jobs on a busy machine you’ll only slow down everyone else’s jobs, without getting your own work done any more quickly. It’s best to find a less busy machine, or consider using HTCondor to find spare CPU cycles for you.