Foreword¶

This tutorial is meant to learn about using HTCondor on the Biochemistry Computational Cluster (BCC).

The BCC runs under Linux and therefore all examples will be shown for this operating system. The command-line is bash.

As a general guide, some marks are placed along most of the tutorials to indicate action to be taken by the reader. The $ is the prompt and should not be copied.

Bash

    $ command to be typed

Often typewritten styled text illustrates a software output.

This is a very short description of the Biochemistry cluster and HTCondor.

Further information about creating job submission files should be studied within the HTCondor online manual.

What is HTCondor?¶

HTCondor is a “scheduler” system that dispatches compute jobs to one and up to a very large number of “compute nodes” that actually perform the calculations.

HTCondor is developed by the Computer Sciences Department at the University of Wisconsin-Madison. The HTCondor web page¹ contains a long description. Here is the first, summary-like paragraph:

Abstract

HTCondor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, HTCondor provides:

a job queueing mechanism,
scheduling policy,
priority scheme,
resource monitoring, and.
resource management.

Users submit their serial or parallel jobs to HTCondor, HTCondor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion.

Warning

Note: Using HTCondor is the only approved method for performing high throughput computing on the BCC Linux cluster.

Jobs have to be ready to be processed by HTCondor as jobs cannot be interactive on the cluster.

Note: The HTCondor software was known as 'Condor' from 1988 until its name changed in 2012.

Cluster access overview¶

The Biochemistry Computational Cluster (BCC) is a High Throughput Computing (HTC) environment within the UW-Madison Biochemistry Department. HTC provides rapid, parallel computing on a large number of small jobs.

Note: this is different from High Performance Computing (HPC) which allows computation on large datasets in large memory footprints.

If you require HPC rather than HTC, or if you are not part of the Biochemistry department, you may obtain a free account at the “Center for High Throughput Computing (CHTC)” at http://chtc.cs.wisc.edu:

“Standard access to CHTC resources are provided to all UW-Madison researchers, free of charge.”

Text-based access¶

The BCC cluster must be accessed via secure shell (ssh) with a text-based Terminal from a local computer. There is NO GUI interface available. For example:

Macintosh: /Applications/Terminal
Linux: Terminal or Shell
Windows: install free software e.g. PuTTY or MobaXterm

Note: The "submit" node is the only computer accessible to users, and jobs will be passed on to the larger hardware portions of the cluster that are not accessible directly to users.

The text-based commands used is bash

A picture containing text, sign, tableware, dishware Description automatically generated

No graphical interface (X11)¶

Important note: there is no graphical user interface (GUI) available in any form as the X11 graphical base is not installed on the operating system.

Therefore, the only mode of action is via text-based access as described above.

Note: The ssh modifier -Y would not allow GUI either.

VPN access¶

Access from outside the Biochemistry department requires a VPN connection:

Tip

If connecting from outside the Biochemistry department it will be necessary to connect via a Virtual Private Network (VPN) to mimic local presence.
Please refer to the following resources to install and activate VPN: General University description:
it.wisc.edu/services/wiscvpn

From a text-based terminal use ssh to login:

ssh myname@submit.biochem.wisc.edu

where myname is your NetID. Your NetID Password will be required after you press return.

For more information about NetID see:

• Activating Your Account: https://kb.wisc.edu/page.php?id=1140

• Getting authorized for BCC access: email Jean-Yves Sgro: jsgro@wisc.edu

After login you will see the following screen:

Text Only

*******************************************************************************
*        Welcome to the UW-Madison Biochemistry Computational Cluster         *
*                                                                             *
*     USE /scratch FOR JOB DATA! DO NOT STORE DATA IN YOUR USER FOLDER!!!     *
*    MOVE YOUR RESULTS TO OTHER STORAGE AFTER YOUR JOB COMPLETES, ALL DATA    *
*              MAY BE REMOVED BY ADMINISTRATORS AT ANY TIME!!!                *
*                                                                             *
*              This computer system is for authorized use only.               *
*                                                                             *
*    You must receive permission to access this system. Please contact the    *
*    Biochemistry Computational Research Facility for training and access.    *
*                     https://bcrf.biochem.wisc.edu/bcc/                      *
*******************************************************************************

Linux OS: CentOS¶

There are many Linux versions. On the BCC the version installed is called “CentOS” which is derived from “Red Hat Linux.”

The version installed can be obtained with the command:

Bash

    cat /etc/redhat-release

CentOS Linux release 7.9.2009 (Core)

The uname command can be used to obtain further information with -a printing all info:

Bash

    $ uname -a

Linux biocwk-01093l.ad.wisc.edu 3.10.0-1160.45.1.el7.x86_64 \#1 SMP Wed Oct 13 17:20:51 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Noteworthy: x86_64 means it is a 64 bit system and el7 means “Enterprise Linux version 7” meaning that it is derived from the Enterprise Linux 7 version from the company Red Hat.

Logo	Name	Link
	CentOS	https://www.centos.org
	redhat	https://www.redhat.com

Before you begin¶

Using HTCondor requires knowledge of the Linux/Unix Shell bash command-line and information about how the cluster is set-up.

There are many preparation steps that will take time to organize. The first question to ask is “Why do I need to use the BCC and therefore HTCondor?” and validate for yourself the reasons why that would make things better for your computation.

If you decide that HTCondor will be useful, you will then need to evaluate how the software you want to use can work on the BCC Linux cluster. It may also be required that you “compile” the software yourself. There is no pre-installed software other than the Linux OS itself.

Jobs running on BCC can take advantage of the /scratch directory as the local HTCondor is aware of its existence. However, that would not apply to jobs sent outside of the cluster.

Jobs have to be able to run in “batch” mode i.e. non-interactively. This means that you need to know what the software will require to run, such as input file(s) or ancillary data files.

Summary

Summary of what you need & need to know.
This will be reviewed further:

Username: Your UW-Madison NetID username and password.
Login to: submit.biochem.wisc.edu.
Cluster: Understand that /scratch is the main work place.
bash shell: Understand commands such as cd, ls, mkdir, etc.
Software: Understand all requirements of the software to run.

The BCC Linux cluster¶

Overview¶

The Biochemistry Computational Cluster (BCC) is a High Throughput Computing (HTC) environment within the UW-Madison Biochemistry Department.

The cluster can be described as a set of 10 computers connected to each other and sharing a common disk space allocation. As these are not really computers with a keyboard, a mouse and a screen they are typically referred to as “nodes.”

Only one node is accessible directly to users to submit jobs to the other 9 nodes. This node is called the “Submit Node” which also plays a role in job control. One could view the set-up in this simplified hierarchy:

computer hierarchy

Therefore all jobs and interaction have to go through the Submit node that will dispatch jobs, or job portions to other nodes. This means that the required calculations have to run in batch, non-interactive mode.

The Submit node controls 2 Tb of disk space made available and shared with the other nodes. Each compute node also has a 240 Gb of space to use while performing calculations and is therefore not useable as storage space.

The compute nodes are also equipped with state-of-the-art graphics chips (GPU) that can be specifically requested for calculations by software that are GPU-aware and can greatly accelerate calculations. However, note that some software require specific GPU architecture. Since computations are not to occur on the submit node, this computer does not have a GPU.

The hardware specific data for the cluster is as follows:

Submit Node

2 x Intel Xeon E5-2650v2 8-Core 2.60 GHz (3.4GHz Turbo)
Over 2TB of SSD based RAID 5 scratch disk space shared with each BCC computational node
128 GB DDR3 1866 ECC/REG Memory
10G Ethernet networking

9 x Dedicated Computation Node

2 x Intel Xeon E5-2680v2 10-Core 2.80 GHz (3.60GHz Turbo)
64 GB DDR3 1866 ECC/REG Memory
1 x NVIDIA Tesla K20M GPU
1 x 240 GB SSD
10G Ethernet networking

Connecting to the cluster¶

Only the “Submit Node” is accessible to users via a text-based terminal connection with the secure shell command:

Login with ssh myname@submit.biochem.wisc.edu where myname is your NetID and your Password will be required after you press return. You need to be authorized to access BCC.

Disk space, home directory, /scratch¶

The default home directory is mapped according to the username and has a long, default name reflecting how the NetID username is “mapped” to the BCC cluster.

If myname represents NetID the default $HOME directory will be mapped as shown below after a pwd command:

/home/myname@ad.wisc.edu

Tip

Important note: the default $HOME directory should NOT be used as the primary location for storage, or for HTCondor job submission. HTCondor cannot "see" your home directory

The main work area is called /scratch and should be used for all jobs.

In addition, HTCondor cannot “see” your home directory but is aware of the /scratch Directory.

HTCondor nodes are set-up in similar ways, and typically they all understand the shared disk space known as /scratch

Warning

Each user should therefore create a working directory within /scratch and work from there rather than the default $HOME directory.

File servers¶

File server research.drive.wisc.edu is accessible from within the cluster as “mounted” volumes within /mnt/rdrive/labname where labname is the name of the lab.

These mounted volumes are not visible by HTCondor.

See Moving files on and off the cluster.

Process¶

The fundamental process consists of submitting a “job file” that contains information on how to run the software that needs to be run with all optional input and ancillary files.

Typically, one would need to create a shell script (*.sh) that can run the desired software, and then create another, submit script (*.sub) that would submit the shell script to HTCondor that will schedule the job.

process

After the run has completed successfully provisions exists within the *.sub file to transfer all created files to e.g. back to /scratch.

Getting ready¶

To get ready you need to evaluate what was just mentioned in the Process paragraph above backwards:

process

HTCondor file transfers¶

Part of the HTCondor method is to transfer files (sometimes all files and software binaries) to a temporary directory, run the job and copy the output files back to your permanent working directory (e.g. on /scratch) upon completion.

HTCondor runs on a temporary directory that changes every time. For example, this directory could be called:

TMPDIR=/var/lib/condor/execute/dir_30471

This directory and all its files will disappear once the job is done. For the next job the directory would have a different name.

When the job starts, all the necessary files (e.g. hello.sh, hello.sub) are transferred to a temporary directory (dark arrow.) When the job is done, output files created by the job are transferred back to the originating (or specified) directory (white arrow.)

process

Beyond local cluster: Flocking¶

When jobs are big, it may be useful to access computers that are beyond that of the Biochemistry cluster itself.

This can be done safely even with proprietary software as files are invisible to others and cannot be copied.

Sending files outside of BCC is a special case called “Flocking” and it may be necessary to adjust the submit command file to either be more “generic” or provide more details of files to be transferred, for example files that are not standard on other systems.

In particular the /scratch directory would not be visible on these other computers and therefore cannot be used.

See also the Center for High Throughput Computing (CHTC) at http://chtc.cs.wisc.edu

QuickStart¶

(see QuickStart section...)

Resources / Help¶

Now that you know how to log-in and run the simplest job, here are resources to go further and learn how to use HTCondor with your own software.

On-line resources:¶

Resource	Link
HTCondor Quick Start Guide	http://research.cs.wisc.edu/htcondor/manual/quickstart.html
Complete manual (*)	http://research.cs.wisc.edu/htcondor/manual/

(*) You can check which manual you need by checking which version of HTCondor is installed with command: condor_version (see above.)

Tip

It is highly advised to get acquainted with some of this material before attempting any complex calculations on the cluster.

Info

For general HTCondor questions contact chtc@cs.wisc.edu.
For Biochemistry related questions contact jsgro@wisc.edu.
For general Biochem IT/network issues contact helpdesk@biochem.wisc.edu.

HTCondor concepts¶

We have already learned perhaps the most important command which is the one use to submit a job: condor_submit

The list of HTCondor commands is rather long: about 70. However, for most users and everyday use just a few are essential, for example to start, stop, hold, restart, and list submitted jobs.

Class Ads¶

Note

From HTcondor manual:
Before you learn about how to submit a job, it is important to understand how HTCondor allocates resources.

ClassAds in HTCondor are comparable to classified ads in a newspaper. Sellers advertise what they sell; buyers may advertise what they wish to buy; both buyers and sellers have specifics and conditions.

Compute nodes ClassAds actively advertise lists of attributes and resources available. For example: the type of CPU (Intel) and its speed, memory (RAM) available, operating system (Linux, Mac, Windows) etc.

Therefore, jobs can be submitted with either generic or very stringent requirements, via a specific syntax within the *.sub submit file.

For example, a user may require that the compute node be equipped with a graphical processor unit (GPU) and a minimum amount of RAM, for example 64Mb, but with a preference for 256Mb is possible. Many other requirements can be added, depending on the software to be run by the job.

ClassAds advertised by both nodes and jobs are continuously read by HTCondor that will match requests and verify that all requirements for both ClassAds are met.

process

The HTCondor command condor_status provides a summary of the ClassAds in the pool:


`condor_status -avail`	shows only machines which are willing to run jobs now.
`condor_status -run`	shows only machines which are currently running jobs for your username.
`condor_status -help`	provides a list of many other options.
`condor_status -long`	lists the machine ClassAds for all machines in the pool. But the output is very long: about 100 ClassAds per compute node.

Here are a few lines from a -long command, truncated on both ends. This command will report all information for all cpus of all nodes and would output a total of 69,604 lines!)

A few useful information such as memory or Linux OS are shown in bold below:

Text Only

[truncated above]
LastUpdate = 1527862877
LoadAvg = 1.0
Machine = "cluster-0001.biochem.wisc.edu"
MachineMaxVacateTime = 10 * 60
MachineResources = "Cpus Memory Disk Swap GPUs"
MAX_PREEMPT = ( 3 * 3600 )
MaxJobRetirementTime = MAX_PREEMPT * TARGET.OriginSchedd == "submit.biochem.wisc.edu"
Memory = 1024
Mips = 27186
MonitorSelfAge = 2159646
MonitorSelfCPUUsage = 1.720120371398834
MonitorSelfImageSize = 80268
MonitorSelfRegisteredSocketCount = 39
MonitorSelfResidentSetSize = 10100
MonitorSelfSecuritySessions = 122
MonitorSelfTime = 1530022493
MyAddress = "<128.104.119.168:9618?addrs=128.104.119.168-9618+[--1]-9618&noUDP&sock=1782_e1db_3>"
MyCurrentTime = 1530022541
MyType = "Machine"
Name = "slot1_1@cluster-0001.biochem.wisc.edu"
NextFetchWorkDelay = -1
NiceUser = false
NumPids = 1
OfflineUniverses = {  }
OpSys = "LINUX"
OpSysAndVer = "CentOS7"
OpSysLegacy = "LINUX"
OpSysLongName = "CentOS Linux release 7.5.1804 (Core)"
 [truncated here]

The list contains about 100 attributes “advertised” continuously in order to match jobs with nodes.

Here are output examples (shortened by the |head command limiting output to 5 lines)

Bash

    $ condor_status | head -5

Text Only

Name               OpSys      Arch   State     Activity LoadAv Mem    ActvtyTime

slot1@cluster-0002 LINUX      X86_64 Unclaimed Idle      0.170 42887 30+00:18:21
slot1_10@cluster-0 LINUX      X86_64 Claimed   Busy      1.000  512  3+23:01:35
slot1_11@cluster-0 LINUX      X86_64 Claimed   Busy      1.000  512  3+23:01:35

Bash

    $ condor_status -avail | head -5

Text Only

Name               OpSys      Arch   State     Activity LoadAv Mem    ActvtyTime

slot1@cluster-0002 LINUX      X86_64 Unclaimed Idle      0.170 42887 30+00:18:21
slot1@cluster-0003 LINUX      X86_64 Unclaimed Idle      0.000 43911 30+00:30:36
slot1@cluster-0004 LINUX      X86_64 Unclaimed Idle      0.000 39303 30+00:22:31

Summary

ClassAds reflect the resources of compute nodes and the requirements of user jobs.
HTCondor matches requirements from both.

Universes¶

Info

From the HTCondor manual:
HTCondor has several runtime environments (called a universe) from which to choose. Of the universes, two are likely choices when learning to submit a job to HTCondor: the standard universe and the vanilla universe.

HTCondor supports different execution environment called universe:

A universe in HTCondor defines an execution environment. HTCondor Version 8.6.11 supports several different universes for user jobs:

standard
vanilla
grid
java
scheduler
local
parallel
vm
docker

=On the BCC the default universe is Vanilla and other choices would be specified in the submit description file. Universes other than Vanilla require specific considerations that will not be mentioned in this document.

Note: While it is the default it is considered good practise to specify the universe within the submit file as the default could be changed at a later date by the system administration of the compute cluster, or could be different on another computer you might use.

Steps before running a job¶

This was discussed in a previous section (see Process on page 14) and reviewed here in the light of information from the manual.

Code preparation. Jobs must be able to run in batch, non-intereactive mode. A program that runs in the background will not be able to do interactive input and output. HTCondor can redirect console output (stdout and stderr) and keyboard input (stdin) to and from files for the program (these standards are part of the operating system.) Create any needed files that contain the proper keystrokes needed for software input. Make certain the software and scripts run correctly with the files.

HTCondor Universe. The Vanilla universe is the default. The Standard universe requires explicit recompilation of the software with HTCondor libraries and is for advanced users.

Submit description file. This plain text file contains details of the job to run, what software (executable) to run, and information about files to transfer, etc. The file can contain explicit requirements that HTCondor will match with compute nodes in terms of ClassAd.

Submit the job. The command condor_submit is used to submit the job described in the job description file.

Once submitted, HTCondor does the rest toward running the job. Monitor the job’s progress with the condor_q and condor_status commands.

You may modify the order in which HTCondor will run your jobs with condor_prio.

You can remove a job from the queue prematurely with condor_rm.

Log file. It is recommended to request a log file for the job within the submit file. Exit status (success or failure) and various statistics about its performances, including time used and I/O performed will be included in the log file.

Requirements and Rank¶

Using the requirements and rank commands in the submit description file is powerful, flexible and requires care. Default values are set by the condor_submit program if these are not defined in the submit description file

For example, the following commands within a submit description file:

Text Only

    request_memory = 32 
    rank = Memory >= 64

require HTCondor to run the program on machines which have at least 32 Mb of physical memory, and the rank command expresses a preference to run on machines with more than 64 Mb.

The commands can use comparison operators: <, >, <=, >=, and == are case insensitive and special comparison operators =?= and =!= compare strings case sensitively.

Please refer to the complete HTCondor manual for more details on the usage of these commands.

File transfer¶

The HTCondor manual has more details on this subject and should also be consulted.

Jobs with Shared File System¶

HTCondor is aware of the files present in the /scratch directory (see above) since the BCC has a shared file system to access input and output files.

Defaults requirements exist so that compute nodes can share the same data. If you place your data in e.g. /scratch/myname/somedirectory it should be visible by any compute node to run your job.

File Transfer Mechanism¶

While the BCC offers a shared file system, there are situations when it may still be appropriate to proceed as if that was not the case, for example if the job is very large and one wants to “flock” the job to a larger grid at CHTC or even the Open Science Grid. In this case a shared file system would not be available to computers outside of the local area.

Info

From the HTCondor manual:
The HTCondor file transfer mechanism permits the user to select which files are transferred and under which circumstances. HTCondor can transfer any files needed by a job from the machine where the job was submitted into a remote temporary directory on the machine where the job is to be executed.

HTCondor executes the job and transfers output back to the submitting machine. The user specifies which files and directories to transfer, and at what point the output files should be copied back to the submitting machine. This specification is done within the job’s submit description file.

To enable the file transfer mechanism, place two commands in the job’s submit description file: should_transfer_files and when_to_transfer_output.

By default, they will be:

Text Only

    should_transfer_files = IF_NEEDED
    when_to_transfer_output = ON_EXIT

Setting the should_transfer_files command explicitly enables or disables the file transfer mechanism. The command takes on one of three possible values:

YES
IF_NEEDED
NO

Specifying What Files to Transfer: If the file transfer mechanism is enabled, HTCondor will transfer the following files before the job is run on a remote machine:

the executable, as defined with the executable command
the input, as defined with the input command

If the job requires other input files, the submit description file should utilize the transfer_input_files command as a comma-separated list.

File Paths for file transfer¶

Info

From the HTCondor manual:
The file transfer mechanism specifies file names and/or paths on both the file system of the submit machine and on the file system of the execute machine. Care must be taken to know which machine, submit or execute, is utilizing the file name and/or path.

See manual for more details. Files can also be transferred by URL (http) e.g. using the wget command.

Managing jobs¶

Once a job has been submitted HTCondor will attempt to find resources to run the job (match the ClassAd from the job requirements with those advertised by the compute nodes.)

Specific commands can be used to monitor jobs that have already been submitted.

A list of submitted jobs, by whom, can be obtained with:

Bash

    condor_status -submitters

Job progress can be assessed with:

Bash

    condor_q

This job ID provided by the previous command condor_q can be used to remove that job if it not longer needed, for example of the ID is 77.0 the command would be:

Bash

    condor_rm 77.0

The job would be terminated and all files discarded.

Jobs can be placed on hold and then released with the commands specifying the job ID:

Bash

    condor_hold 78.0
    condor_release 78.0

A list of jobs in the hold state can be obtained:

Bash

    condor_q -hold

or the reason for their holding:

Bash

    condor_q -hold 78.0

If a job is not running, wait 5 minutes so that ClassAd have been negociated, and then check with the command: (see more in the manual.)

Bash

    condor_q -analyze

Finally, some jobs may be able to have their priority altered by the condor_prio command.

Job completion¶

Info

From the HTCondor manual:
When an HTCondor job completes, either through normal means or by abnormal termination by signal, HTCondor will remove it from the job queue. That is, the job will no longer appear in the output of condor_q, and the job will be inserted into the job history file.

Examine the job history file with the condor_history command. If there is a log file specified in the submit description file for the job, then the job exit status will be recorded there as well.

Statistics about the job will be included in the log file if it was requested within the submit file, as it is strongly suggested.

Tag file names with job or process¶

There are methods to tag the name of files by job name, process number, the compute node or the cluster name as a command within the submit file. For example:

Bash

    output = $(job)_$(Cluster)_$(Process).out

The commands of the style $(name) are the method that HTCondor handles variables with proper set-up.

Example: environment variables¶

Here is a small example putting things together with a submit and executable file.

environ.sub submit file:

Text Only

    executable = run_environ.sh
    output = run_environ_$(Cluster).out
    error  = run_environ_$(Cluster).err
    log    = run_environ_$(Cluster).log
    should_transfer_files = YES
    when_to_transfer_output = ON_EXIT
    request_cpus =1
    queue 1

run_environ.sh executable file:

Bash

    #!/bin/sh
    echo "this is running!"
    pwd > test.out
    ls >> test.out
    printenv >> test.out

submit the job:

Bash

    $ condor_submit environ.sub

Submitting job(s). 1 job(s) submitted to cluster 3775.

We can see the status of the job:

Bash

    $ condor_q

Text Only

-- Schedd: submit.biochem.wisc.edu : <128.104.119.165:9618?... @ 06/26/18 09:39:33
OWNER BATCH_NAME             SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
jsgro CMD: run_environ.sh   6/26 09:38      _      _      1      1 3775.0

1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

The job runs quickly and the result files are transferred back, as shown with this list:

Bash

    $ ls

Text Only

environ.sub             run_environ_3775.out
run_environ_3775.err  run_environ.sh
run_environ_3775.log  test.out

The executable file (run_environ.sh) created a standard output that captured within file run_environ_3775.out and will contain a single line with “this is running!”

The file test.out created by the executable will contain the directory name on the compute node and then the values of environment variables:

Bash

    $ cat test.out

Text Only

/var/lib/condor/execute/dir_6663
_condor_stderr
_condor_stdout
condor_exec.exe
test.out
_CONDOR_JOB_PIDS=
_CONDOR_ANCESTOR_1864=4326:1527864152:618879280
TMPDIR=/var/lib/condor/execute/dir_6663
_CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_6663
_CHIRP_DELAYED_UPDATE_PREFIX=Chirp
TEMP=/var/lib/condor/execute/dir_6663
BATCH_SYSTEM=HTCondor
_CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_6663/.chirp.config
_CONDOR_ANCESTOR_4326=6663:1530024011:1517681191
PWD=/var/lib/condor/execute/dir_6663
CUDA_VISIBLE_DEVICES=10000
_CONDOR_AssignedGPUs=10000
_CONDOR_ANCESTOR_6663=6668:1530024013:4039808267
_CONDOR_SLOT=slot1_2
SHLVL=1
_CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_6663/.machine.ad
TMP=/var/lib/condor/execute/dir_6663
GPU_DEVICE_ORDINAL=10000
OMP_NUM_THREADS=1
_CONDOR_JOB_AD=/var/lib/condor/execute/dir_6663/.job.ad
_CONDOR_JOB_IWD=/var/lib/condor/execute/dir_6663
_=/usr/bin/printenv

This example illustrates that the directory where calculations happen is different every time. In this example it is /var/lib/condor/execute/dir_6663.

Transfer of environment: getenv¶

At BCC we had a situation where a script would fail as a job but ran perfectly fine as an interactive test (at the shell prompt) or as an Interactive HTCondor job. The solution was to add one line to the submit file:

Text Only

    getenv = true

The following explanation was given by CHTC personnel:

Tip

This can happen because the environment is subtly different between the two -- the interactive job inherits all of your usual shell environment, but the batch job doesn't.

To address this, you can use this option in the submit file:

getenv = true

This will emulate the interactive environment in the batch job and hopefully solve your problem there.

In the next section we’ll learn about specifying local library dependencies.

Library dependencies¶

Software exist in the form of binary files, but often rely also on external “libraries” that are required. In some cases, the software can be compiled specifically to incorporate the libraries within its own binary file in order to eliminate external dependencies. However, this is not always possible and a method to deal with external libraries exist within HTCondor: the LD_LIBRARY_PATH environment variable.

For example:

Bash

    LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/scratch/myname/lib

(replace myname with your directory name; in bash use the export command.)

Question: Why/When would I need this?

On the BCC Linux cluster, you will need to use this method if any of your software is written in the FORTRAN language as the FORTRAN libraries are not available on a default, modern Linux installation.

Note: this is true even if you compile with the "-static" or equivalent option(s) as will be shown in the following, simple example.

FORTRAN 77: Hello World example¶

Warning

UPDATE: SKIP THIS PARAGRAPH
f77 and g77 are no longer installed on the cluster. This section is kept here for general information about the use of the LD_LIBRARY_PATH option.

All FORTRAN program to be run on the BCC Linux cluster will require this step.

FORTRAN programs can be compiled with the compilers f77 or g77 each with specific modifiers.

Let’s create a minimal “Hello World!” FORTRAN 77² program that prints to standard output, save it in a file called hello77.f for example:

Fortran

PROGRAM HELLO
WRITE(\*,\*) 'Hello, world!'
END

Then we can compile the program with the mode to request that external libraries be included in the final binary. This “switch” is different for both compiler options:

Bash

    $ g77 -fno-automatic hello77.f -o hello77_static_g77
    $ f77 -Bstatic hello77.f -o hello77_static_f77

In practise, the resulting binary has the same size as if the “static” option was not invoked (this can be verified with the Unix command ls -l to list files.)

Running the file interactively within the shell will always work:

Bash

    $ ./hello77_static_f77

Hello, world!

However, running this file with HTCondor on the BCC Linux cluster will cause an error to be reported and the program will FAIL:

Text Only

    ./hello77_static_f77: error while loading shared libraries: libg2c.so.0: cannot open shared object file: No such file or directory

This is true for ALL FORTRAN programs. Therefore, there will be a requirement for a SYSTEM library, defined within the *.sh script file. This will tell the executable program where the necessary library is located.

The following example takes advantage to the fact that /scratch is a shared volume. Therefore, after copying the "libg2c.so.0" library to /scratch/myname/lib a minimal script file could be written as:

Bash

    #!/bin/bash
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/scratch/myname/lib
    # run the program:
    ./hello77_static_f77

IF there is no shared volume on the cluster, the necessary library or libraries could be placed in a “lib” directory to be transferred with the other files at submission and a relative path could be created. For example:

export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:./lib

FORTRAN 95: gfortran¶

gfortran (also called f95 on the BCC Linux cluster) is GNU FORTRAN and is the software to use.

Bash

    $ gfortran --version

Text Only

GNU Fortran (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
Copyright (C) 2015 Free Software Foundation, Inc.

GNU Fortran comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of GNU Fortran
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING

The general compilation format is:

Bash

    $ gfortran hello77.f -o hello77_gfortran

Static linking¶

From the GNU web site³:

Info

gfortran is composed of two main parts: the compiler, which creates the executable program from your code, and the library, which is used when you run your program afterwards. That explains why, if gfortran is installed in a non-standard directory, it may compile your code fine but the executable may fail with an error message like library not found . One way to avoid this (more ideas can be found on the binaries page) is to use the so-called "static linking", available with option -static gfortran then put the library code inside the program created, thus enabling it to run without the library present (like, on a computer where gfortran is not installed). Complete example is:

gfortran -static myfile.f -o program.exe

However, just doing this causes an error on the cluster:

Bash

    $ gfortran -static hello77.f -o hello77_static_gfortran

Text Only

/usr/bin/ld: cannot find -lgfortran
/usr/bin/ld: cannot find -lm
/usr/bin/ld: cannot find -lquadmath
/usr/bin/ld: cannot find -lm
/usr/bin/ld: cannot find -lc
collect2: error: ld returned 1 exit status

We can find the location of the library with⁴:

Bash

    $ gfortran -print-file-name=libgfortran.so
    ```

> `/usr/lib/gcc/x86_64-redhat-linux/4.8.5/libgfortran.so`

Then export LDFLAGS with the above information, <u>without</u> the file name:

```bash
    $ export LDFLAGS=-L/usr/lib/gcc/x86_64-redhat-linux/4.8.5

STILL DOES NOT WORK ???????????????????????????????????

The specific requirements for libraries do not work in the same way for gfortran and binaries will have a different size if compiled with the necessary switch:

Bash

    $ gfortran hello77.f -o hello77_gfortran
    $ gfortran hello77.f -static-libgfortran -o hello77_static_gfortran

We can see that the sizes are different:

Text Only

    $ ls -l *gfortran | cut -c34-110
       7970 Jun  2 13:17 hello77_gfortran
     158309 Jun  2 13:17 hello77_static_gfortran

A test on the BCC Linux cluster has shown that the gfortran compiled binaries in this simple form do not require any library specifications unlike the files compiled with f77 or g77.

Therefore, which compiler is used makes a difference. However, some program code might be sensitive to the nature of the compiler so this might not be a universal result.

Compiling¶

Compiling other programs in various languages will require some knowledge on compilation methods and verifying that the appropriate compiler or compilers are installed.

The command "which" can be useful to locate where program are installed if they are present.

Compilers installed on BCC Linux include: gcc, gcc34, cc, c89, c99, c++, g++, g++34, g77, f77, f95, gfortran.

C pre-processors installed: cpp and mcpp.

Standard Universe¶

The Standard Universe offers advantages to the Vanilla Universe especially for long running computations that may create incremental output files.

However, the software has to be recompiled with the condor_compile command.

Access to the original source or object code is required for this step. If not, this is not possible and the Standard Universe cannot be used.

Compiling R¶

Since there are no common directories for software each user has to compile their own. Once compiled the software should be moved within the /scratch directory or a subdirectory within.

Download R¶

The repository for R is “The Comprehensive R Archive Network” or CRAN https://cran.r-project.org/

There are multiple options for Linux, but for redhat there is only a README file. For us we need to download the source code, which we can get from:

https://cran.r-project.org/src/base/R-3/

For example, download R-3.2.5.tar.gz

Install¶

Most open-source software has a README and an INSTALL file in plain text with information.

Note: the CHTC help pages suggest to use an interactive HTCondor session. However, low impact compilation could reasonably occur on the “submit” node.

Notes:

there is no X11 on the cluster, therefore care has to be given to not use X11 during compilation, this is accomplished with --with-x=no
To avoid library sharing --enable-R-shlib was removed from the default command.

Uncompress:

Bash

    tar zxvf R-3.2.5.tar.gz

Then change into the new directory (cd), configure and compile (make):

Here is an example to compile R version 3.2.5:

Bash

    cd R-3.2.5
    # Compile R 3.2.5 on 12/12/2016 in /scratch/jsgro/R/
    ./configure --prefix=$(pwd) --with-readline=no --with-x=no

    make

The R and Rscript executables are located in the ./bin directory.

To start R interactively from the R-3.2.5 directory type:

Bash

    ./bin/R

Problem with Zlib in newer version¶

Starting with version R-3.3.0 there is a newer requirement for zlib which does not seem to be installed on BCC. Therefore, it may be necessary to compile zlib first for these versions.

See for example:

http://pj.freefaculty.org/blog/?p=315
https://stat.ethz.ch/pipermail/r-devel/2015-April/070951.html

The first link provides an example to compile zlib.

The second link provides an example to configure differently.

R additional packages¶

To use R on the cluster additional packages should be added first from e.g CRAN or Bioconductor and the compiled R directory compressed and transferred e.g. with tar.

Bash

    tar -czvf R.tar.gz R-3.2.5/

The added packages will be added in the directory echoed by the R command .Library

In the case of this example, start R (e.g. ./bin/R) , then within the R console:

S

    > .Library
    [1] "/scratch/jsgro/R/R-3.2.5/library"

The compressed tar file should then be part of the files transferred and unarchived by a command contained within the executable (*.sh) file for the job.

It should be sorted out how to transfer the compressed file(s) onto the compute node, uncompress and unarchive them and how to access them.

See for example the section on LD_LIBRARY_PATH above to implement this.

Interactive Jobs¶

Interactive jobs are now possible on BCC. The default Universe will be Vanilla.

From the manual⁶:

Info

An interactive job is a Condor job that is provisioned and scheduled like any other vanilla universe Condor job onto an execute machine within the pool. The result of a running interactive job is a shell prompt issued on the execute machine where the job runs. The user that submitted the interactive job may then use the shell as desired, perhaps to interactively run an instance of what is to become a Condor job.

This option may be beneficial for example to compile software (see previous section) without impeding the submit node.

Interactive shell¶

The simplest command to start an interactive job is:

Text Only

    $ condor_submit -interactive

Text Only

Submitting job(s).
1 job(s) submitted to cluster 1108.
Waiting for job to start...

There may be a delay for the interactive shell to begin depending on the cluster availability.

The interactive shell will be running as an HTCondor job on a compute node rather than the submit node.

Once activated a splash screen of information will be presented. An error about “home directory” not found is normal as HTCondor nodes are only aware of the /scratch directory as explained previously.

Note that the shell will log itself out after 7200 seconds (2hrs) of inactivity.

We now have an interactive bash shell.

Bash

    bash-4.2$ pwd

/var/lib/condor/execute/dir_36940

Bash

    bash-4.2$ printenv | fgrep -v LS_COLORS

Text Only

_CONDOR_JOB_PIDS=
HOSTNAME=cluster-0005.biochem.wisc.edu
TERM=xterm-256color
SHELL=/bin/bash
HISTSIZE=1000
SSH_CLIENT=128.104.119.165 22309 9618
TMPDIR=/var/lib/condor/execute/dir_36940
_CONDOR_ANCESTOR_28163=36940:1525207142:4203470193
_CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_36940
SSH_TTY=/dev/pts/0
USER=jsgro
_CHIRP_DELAYED_UPDATE_PREFIX=Chirp
LD_LIBRARY_PATH=/usr/local/cuda/lib64:
TEMP=/var/lib/condor/execute/dir_36940
BATCH_SYSTEM=HTCondor
_CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_36940/.chirp.config
TMOUT=7200
[truncated]

It is important to note that the environment variables are transferred from the user, including the $HOME directory even if HTCondor cannot access it!

Most importantly the /scratch directory is available:

Bash

    bash-4.2$ cd /scratch

From this point it will be possible to work to prepare software and files without impacting the submit node.

When done, the shell can be released with the exit command:

Bash

    bash-4.2$ exit

logout
Connection to condor-job.cluster-0005.biochem.wisc.edu closed.

File transfer from `research.drive.wisc.edu`¶

Moving files on and off the cluster:¶

Note: older "fs.biochem.wisc.edu", fs2 and similar servers are now defunct.

The current file server is smb://research.drive.wisc.edu that can be mounted on a Mac or PC as a network drive. Therefore files stored on these computers will be available within the cluster.

HTcondor does not see these directories, therefore files of interest have to be either moved onto /scratch or transferred via the .sub command file.

Useful commands¶

The cp or mv commands can be used to copy or move files.

Adding -r can help to transfer complete directories and their content.

Alternatively, archiving commands can also prove useful such as:

zip and unzip
gzip and gunzip
tar

http://research.cs.wisc.edu/htcondor/description.html ↩
Examples for many languages can be found at http://en.wikipedia.org/wiki/List_of_Hello_world_program_examples [ARCHIVED 11APR2015; SHORT URL: https://bit.ly/2N3dit8+] ↩
https://gcc.gnu.org/wiki/GFortranGettingStarted ↩
https://github.com/JuliaLang/julia/issues/6150 ↩
http://research.cs.wisc.edu/htcondor/manual/v8.6/2_4Running_Job.html ↩
http://research.cs.wisc.edu/htcondor/manual/v8.6/2_5Submitting_Job.html (see section 2.5.13) ↩

Foreword¶

What is HTCondor?¶

Cluster access overview¶

Text-based access¶

No graphical interface (X11)¶

VPN access¶

Login info¶

Login Splash Screen¶

Linux OS: CentOS¶

Before you begin¶

The BCC Linux cluster¶

Overview¶

Connecting to the cluster¶

Disk space, home directory, /scratch¶

File servers¶

Process¶

Getting ready¶

HTCondor file transfers¶

Beyond local cluster: Flocking¶

QuickStart¶

Resources / Help¶

On-line resources:¶

HTCondor concepts¶

Class Ads¶

Universes¶

Steps before running a job¶

Requirements and Rank¶

File transfer¶

Jobs with Shared File System¶

File Transfer Mechanism¶

File Paths for file transfer¶

Managing jobs¶

Job completion¶

Tag file names with job or process¶

Example: environment variables¶

Transfer of environment: getenv¶

Library dependencies¶

FORTRAN 77: Hello World example¶

FORTRAN 95: gfortran¶

Static linking¶

Compiling¶

Standard Universe¶

Compiling R¶

Download R¶

Install¶

Problem with Zlib in newer version¶

R additional packages¶

Interactive Jobs¶

Interactive shell¶

File transfer from research.drive.wisc.edu¶

Moving files on and off the cluster:¶

Useful commands¶

File transfer from `research.drive.wisc.edu`¶