In class these exercises will be run onto the classroom iMacs.
However, as best as I can I’ll provide Windows hints and instructions when possible, but a basic understanding of line-command under Windows would be more than useful for that (e.g. know what is DOS for example.)
Be familiar with Docker or follow workshop 1 “Docker - Beginner Biologist 1”
Docker will be used from a line-command terminal: Terminal
on a Macintosh in the classroom. A rudimentary knowledge of bash
command-line is necessary.
If you are a Windows user: PowerShell
can be used as a Terminal. However, setting Docker to run on Windows is more involved (not covered in class.)
Docker username: downloads will require a (free) username, therefore registration is necessary in order to follow the tutorial. Go to https://hub.docker.com and use the button “Sign up for Docker Hub” to register.
Tutorials will be held in the Biochemistry classroom 201, and Docker has already be installed.
Instruction for installation can be found on the install link1 of the Docker web site.
Note HTML Version only:
If you are following this document in HTML format the code is shown with a colored background:
White background: standard output of programs.
To get started we need to open a text terminal as detailed below. In class we’ll use a Macintosh.
Do one of the following:.
If you are on a Macintosh:
Terminal
icon in the /Applications/Utilities
directory. Then double-click on the icon and Terminal
will open.Terminal
and press return. Terminal
will open.If you are on a PC:
Power Shell
e.g. using Windows search or Cortana. This will open a suitable text-based terminal.(Note: Windows cmd
does not offer the appropriate commands.)
This ensures that Docker is properly installed. The exact running version itself is not very important.
At the $
or >
prompt within the window of Terminal
, cmd
or PowerShell
type docker --version
to check the version currently installed.
Before going further, it is necessary now to login with your Docker Hub ID. You should already have created one before this or the previous workshop. If you need to create an ID now go to https://hub.docker.com to register.
Docker login:.
Login with your Docker ID to push and pull images from Docker Hub.
If you don't have a Docker ID, head over to https://hub.docker.com
to create one.
Username: YOUR_DOCKER_ID_HERE
Password:
Login Succeeded
$
Note: if you do not login first you will receive an error message when tryingt to start docker in the next steps.
In due time you will be able to create your own docker image. But for now we’ll use images that are available on the Docker hub.
A docker image can contain a single useful software, or it can give access to a series of software. The more software the image contains the more disk space it is likely required. For example, the ORCA image (Jackman et al. (2019)) is close to 30Gb in size but contains over 600 the bioinformatics software and utilities.
For this series of exercises we’ll look for and use a docker image of the EMBOSS (Rice, Longden, and Bleasby (2000)) series of sequence analysis software.
“The European Molecular Biology Open Software Suite” (EMBOSS) is a free Open Source software analysis package specially developed for the needs of the molecular biology user community.2
EMBOSS contains a large number of sequence analysis tools, and we’ll sample a few of them via a docker method.
The purpose of this tutorial is more about learning how to use a Docker container rather than learning EMBOSS itself. However, here are a few links for learning more about EMBOSS for reference:
EMBOSS | Link |
---|---|
Home page | http://emboss.sourceforge.net |
Tutorial | http://emboss.sourceforge.net/docs/emboss_tutorial/emboss_tutorial.html |
Applications | http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/index.html |
Grouped by functions | http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/groups.html |
Any person with a Docker ID can create and upload images that are accessible to other users. Therefore we’ll find a large number of Docker images available. However, they will not be constructed in the same way, may not contain the same version of the software, and might not have been updated in a long time. Therefore finding a suitable image might require some browsing before deciding which one(s) to download and test.
Open we web browser.
The resulting web page will show results. As of today (Oct 2019) there are 23 results.
How many results did you get? ______________________________________________
The results are shown on a web page shown by “Most Popular” which may be the best option. The other option is “Recently Updated” which may be a better choice in certain cases.
For today we’ll chose the the first one named “biocontainers/emboss”
Important note: the full name of the docker image is biocontainers/emboss containing 2 words. This complete name will need to be used later to activate it.
Click on the biocontainers/emboss box.
The user “biocontainer” is a provider of a large number of other docker images and has well organized pages. Once you get on that page you’ll see that there are different tabs named:
In the next step we’ll want to pull
(donwload from the hub) the docker image. On the default (Overview) tab you can see (on the right) the command that you can copy to pull
the image onto your computer. However, if you were to do this now you’d have an error:
Using default tag: latest
Error response from daemon: manifest for biocontainers/emboss:latest not found: manifest unknown: manifest unknown
The error is apparently due to the fact that docker
cannot find biocontainers/emboss:latest
.
Tags:
This means that we need to talk about tags. The default tag is latest
and is not required by default, just assumed since it’s the default. This is true most of the time, unless the author(s) of the image decide that they want to use a specific tag. In that case latest
does not exist and the specific tag has to be clearly mentioned on the pull
request.
For example, in the previous workshop we pulled the image for the small linux distribution called alpine
. The command was simply docker pull alpine
. Then, when we asked to show the list of images with the command docker image ls alpine
we could note that latest
was entered under the column TAG
:
REPOSITORY TAG IMAGE ID CREATED SIZE
alpine latest 11cd0b38bc3c 14 months ago 4.41MB
However, for the biocontainers images, it is necessary to use a specific tag which is listed under the Tags
tab of the web page for the container.
Click on the Tags tab of the biocontainers/emboss page.
You will note that by default the tags are shown sorted as “latest” (right hand side pull-down menu). As of today this looks like this:
As of this writing the latest release of EMBOSS is 6.6.03 and the tag seems to reflect this within its first few characters v6.6.0
.
Consequence: The
pull
command must contain the complete specific tag.
Pull biocontainers/emboss image.
From the above information it follows that the default pull
command shown on the Overview
Tag page will not work by default and a specific tag needs to be added to the request.
To that effect use the mouse to Copy the tag and add it to the pull request as shown below.
Note: The latest tag might change in the future and may be different than the one used below.
The tag will also need to be used later to activate the image (into a usable container.)
In a previous workshop we learned how to list docker images that are currently installed on the system. We can specifically list this one with the following command:
REPOSITORY TAG IMAGE ID CREATED SIZE
biocontainers/emboss v6.6.0dfsg-7b1-deb_cv1 bc147a9dd825 5 weeks ago 638MB
Note the TAG
column.
Now that we have what seems to be an appropriate image with EMBOSS of the latest version, we can now activate the image and “dive into it!”
Reminder: To create a container from an image we use the command docker run
that can also be altered by a number of modifiersIn the following command we’ll add the following modifiers as we have learned in the previous workshop:
-t
: “Allocate a pseudo-TTY” (i.e. a text terminal)-i
: interactive--rm
: “Automatically remove the container when it exits”docker run --help
.Finally remember that the image tag is mandatory, otherwise you’ll have an error that says:
Unable to find image 'biocontainers/emboss:latest' locally
docker: Error response from daemon: manifest for biocontainers/emboss:latest not found: manifest unknown: manifest unknown.
See 'docker run --help'.
We’ll now explore the inside of the container…
Run the folowing command and those that follow:.
biodocker@bb321bea813b:/data$
We are now looking within the container as indicated by the long prompt (with blue background in the HTML version of this document.) we can now explore the inside of the container (directories and software.) Later we’ll restart the container again with a shared directory to actually accomplish some analysis. For now we’ll explore the structure of the container.
We can now check where we are within the container system:
We can also check if there are any files contained within theis directory:
We conculde that the directory is empty. We can remember that fact as useful information to later share this directory when we restart.
The underlying structure of the container is a general Linux system.
Images (and therefore containers) can be created from various verions of Linux, such as the popular Ubuntu.
The word “Linux” is very generic and it may be useful to find out what is the actual version that is running within the container. In fact “Linux” should be called “GNU/Linux”.
The next 3 commands will ask information about the Linux version running under the hood. They are here for information but are not critical to using the container or the EMBOSS software.
The commanduname -a
shows all the information it knows about the system:
This tells us that we are running an Intel compatible (x86
) 64bit system based on linuxkit “a toolkit for building custom minimal, immutable Linux distributions.”4
The next 2 commands show that the type of Linux is based on one of the 2 major families called Debian (the other is part of the “red hat” family.)
One question that arises on occasion is “who is the user” of the container. Some containers are by default running as “root” i.e. administrator level. We have a hint here that we are not running as “root” because the prompt ends with a $
sign, while the “root” user would show a hash/pound sign #
.
We can ask what is the username with the command:
We can also ask what is the defaul shell running:
Therefore we are logged in as user biodocker
running the bash
shell and that is all good.
EMBOSS consists in a series of software for the analysis of protein or nucleic acid DNA and RNA sequences (but not Next Gen sequencing.)
We can figure out where the software is located with a few commands, just knowing the name of at least one of the programs. For example, the program needle
is an implementation of the Needleman-Wunsch global alignment of two sequences (Needleman and Wunsch (1970).)
The bash
command which
shows the location of a given software:
Then we can list the location with a long list:
This tells us that the program is located in another directory (-> ..
indicated a “symbolic link,” sometimes known as “shortcut”) and we can list the complete directory with the understanding that ..
represents the parent directory. Therefore, we can first list the emboss
directory found in /usr/lib
a suggested by the symbolic link:
We can now list the entire emboss
directory with:
Here are the first few lines for the 261 programs that are included in this version:
aaindexextract drfindformat megamerger seqret
abiview drfindid merger seqretsetall
acdc drfindresource msbar seqretsplit
acdgalaxy drget mwcontam seqxref
acdlog drtext mwfilter seqxrefget
...
...
Within the list you should be able to spot the needle
program.
Terminate the container.
This will echo the word exit
and return us to the host computer prompt.
$
We have just used the EMBOSS programs from wihtin the container while sharing a directory with the host computer. In this way it is easy to call each EMBOSS program simply by name directly: needle
, pepwheel
, etc.
However, as we have alluded in the previous workshop, we can also call each of the program from the host computer itself providing we give the proper container information. This may be useful in cases where only one or two programs need to be access while working on a project.
An analogy for using a container from the outside could be using a remote control to control actions within a black-box…
As an example we’ll create a new alignment from sequences located in the dockershare
directory without entering the docker container itself. (Hence, on the HTML version of this document the background will be green -on the host- rather than blue - within container.)
However, it is still necessary to specify the shared directory on the command line as this is the only way that the programs within the container (here neele
) can “see” the files that are on the host computer.
The sequences are located within the dockershare
directory which is mapped to /data
within the container. As such we actually do not need to be within dockershare
on the host computer terminal for these commands to work. However, it is still useful to place the focus on the shared directory to quickly explore any output result:
Then we run the EMBOSS needle
program by activating the container. When the needle
program is finished, the container will exit and be removed (--rm
.)
The name of the sequence have to be typed, while the default options on the next line can simply be accepted by pressing return.
Input sequence: GLP-1.fa
Second sequence(s): GLP-2.fa
Gap opening penalty [10.0]:
Gap extension penalty [0.5]:
Output alignment [glp-1.needle]:
Alternatively the command could also be written as a single line:
docker run -it --rm -v$HOME/dockershare:/data biocontainers/emboss:v6.6.0dfsg-7b1-deb_cv1 needle GLP-1.fa GLP-2.fa -outfil glp-1.needle -gapopen 10 -gapextend 0.5
Needleman-Wunsch global alignment of two sequences
Note that for clarity we can use the “line continuation symbol” \
to break the long line into more readable clode:
docker run -it --rm -v$HOME/dockershare:/data \
biocontainers/emboss:v6.6.0dfsg-7b1-deb_cv1 needle \
GLP-1.fa GLP-2.fa -outfil glp-1.needle -gapopen 10 -gapextend 0.5
We can check that the alignment worked by looking at the last few lines of the output file with the ail
command:
Before we finish we want to make sure that no contaier is in halted mode (e.g. when --rm
is forgotten) as these can accumulate over time and use a large amount of space.
If you have any halted container delete them using the CONTAINER ID
name with e.g. changing the value below to that of your container(s).
docker rm 461346b98938
Docker Commands | Comment |
---|---|
docker --version |
Short output of version |
docker login |
Required. Register at docker.com |
docker pull |
download a docker image from hub.docker.com |
tag |
some docker images require a specific tag |
docker image ls |
list docker image. Equiv command: docker images |
docker run -it --rm -v $HOME/dockershare:/data |
run shell in container, share dockershare directory |
docker container ls -a |
list all containers, same as command above |
Shell Commands | Comment |
---|---|
cat |
print entire file on screen |
cat > filename |
send input to filename |
more |
print text file content on screen. space bar for 1 more screen. q to quit. |
uname -a , cat /etc/issue , cat /etc/os-release |
shell commands to view Linux version |
which |
shell command to find program location |
$HOME |
shell variable designated the default home folder |
cd $HOME/dockershare |
change directory to dockershare located in $HOME |
EMBOSS Commands | Comment |
---|---|
needle |
global pairwise alignemnt of 2 sequences |
pepwheel |
draws a helical wheel diagram for a protein sequence. |
pepinfo |
plots various amino acid properties in parallel. |
Brubaker, P. L., and D. J. Drucker. 2002. “Structure-function of the glucagon receptor family of G protein-coupled receptors: the glucagon, GIP, GLP-1, and GLP-2 receptors.” Recept. Channels 8 (3-4): 179–88. https://doi.org/10.3109/10606820213687.
Jackman, S. D., T. Mozgacheva, S. Chen, B. O’Huiginn, L. Bailey, I. Birol, and S. J. M. Jones. 2019. “ORCA: A Comprehensive Bioinformatics Container Environment for Education and Research.” Bioinformatics, April. https://doi.org/10.1093/bioinformatics/btz278.
Needleman, S. B., and C. D. Wunsch. 1970. “A general method applicable to the search for similarities in the amino acid sequence of two proteins.” J. Mol. Biol. 48 (3): 443–53. https://doi.org/10.1016/0022-2836(70)90057-4.
Park, Min Kyun. 2015. “Subchapter 17A - Glucagon.” In Handbook of Hormones: Comparative Endocrinology for Basic and Clinical Research, 129–31. Academic Press. https://doi.org/10.1016/B978-0-12-801028-0.00138-0.
Rice, P., I. Longden, and A. Bleasby. 2000. “EMBOSS: the European Molecular Biology Open Software Suite.” Trends Genet. 16 (6): 276–77. https://doi.org/10.1016/S0168-9525(00)02024-2.