The main learning objective is to experience different methods of using dockers images and face (and resolve) challenges while doing so.
In class these exercises will be run onto the classroom iMacs.
However, as best as I can I’ll provide Windows hints and instructions when possible, but a basic understanding of line-command under Windows would be more than useful for that (e.g. know what is DOS for example. See APPENDIX C.)
Be familiar with Docker or follow workshop 1 “Docker - Beginner Biologist 1” and workshop 2 “Docker - Beginner Biologist 2.”
Docker will be used from a line-command terminal: Terminal
on a Macintosh in the classroom. A rudimentary knowledge of bash
command-line is necessary.
If you are a Windows user: PowerShell
can be used as a Terminal. However, setting Docker to run on Windows is more involved (not covered in class.)
Docker username: downloads will require a (free) username, therefore registration is necessary in order to follow the tutorial. Go to https://hub.docker.com and use the button “Sign up for Docker Hub” to register.
Tutorials will be held in the Biochemistry classroom 201, and Docker has already be installed.
Instruction for installation can be found on the install link1 of the Docker web site.
Note HTML Version only:
If you are following this document in HTML format the code is shown with a colored background:
White background: standard output of programs.
To get started we need to open a text terminal as detailed below. In class we’ll use a Macintosh.
Do one of the following:.
If you are on a Macintosh:
Terminal
icon in the /Applications/Utilities
directory. Then double-click on the icon and Terminal
will open.Terminal
and press return. Terminal
will open.If you are on a PC:
Power Shell
e.g. using Windows search or Cortana. This will open a suitable text-based terminal.(Note: Windows cmd
does not offer the appropriate commands.)
This ensures that Docker is properly installed. The exact running version itself is not very important.
At the $
or >
prompt within the window of Terminal
, cmd
or PowerShell
type docker --version
to check the version currently installed.
Docker version 19.03.5, build 633a0ea
Before going further, it is necessary now to login with your Docker Hub ID. You should already have created one before this or the previous workshop. If you need to create an ID now go to https://hub.docker.com to register.
Docker login:.
Login with your Docker ID to push and pull images from Docker Hub.
If you don't have a Docker ID, head over to https://hub.docker.com
to create one.
Username: YOUR_DOCKER_ID_HERE
Password:
Login Succeeded
$
Note: if you do not login first you will receive an error message when tryingt to start docker in the next steps.
In the previous workshop we learned how to find, choose, and pull
(download) a docker image from the Docker hub.
Today we’ll download multiple images to access programs in each to accomplish preliminary tasks to genomic analysis. These images belong to a larger project called Bioinformatics Docker Images Project2.
This group of docker images provides useful information on how to run the software contained within. However, there are unclear desciptions and errors that we’ll overcome during the workshop.
We will use the following images:
clustalomega (doc) - Sequence alignment
sratoolkit (doc) - Operations on SRA database
We can get the appropriate pull
request from the web pages on the Docker hub.
pull images.
We can list images with:
REPOSITORY TAG IMAGE ID CREATED SIZE
pegi3s/sratoolkit latest 12fb29fcba17 2 months ago 306MB
pegi3s/fastqc latest 5a439982c750 4 months ago 579MB
pegi3s/clustalomega latest ed9da1fc309e 4 months ago 290MB
Note on size: Docker images are constructed in layers that can be common over multiple images. Therefore, the actual disk space to store multiple images is less than the sum of the
SIZE
that is reported in the list.
From web
clustalomega
documentation3: Clustal Omega is a multiple sequence alignment program for aligning three or more sequences together in a computationally efficient and accurate manner. It produces biologically meaningful multiple sequence alignments of divergent sequences.
A full description of the algorithms used by Clustal Omega is available in the Molecular Systems Biology paper Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega (Sievers et al. (2011).) Latest additions to Clustal Omega are described in *Clustal Omega for making accurate alignments of many protein sequences (Sievers and Higgins (2018).)
As a follow-up to our EMBOSS sequence alignment we’ll use this program to make a very small multiple sequence alignment of the short glucagon peptide family that was saved in a shared directory.
Note: If you need to create the shared directory and the sequence files see below in APPENDIX A.
The glucagon family sequence files are in individual fasta format. We first need to create a fasta file that contains them all with a simple cat
command:
Combine sequences.
The final, multiple sequence fasta file contains all the data:
>GIP
YAEGTFISDYSIAMDKIRQQDFVNWLLAQ
>GLP-1
HAEGTFTSDVSSYLEGQAAKEFIAWLVKGRG
>GLP-2
HADGSFSDEMNTILDNLAARDFINWLIQTKITD
>glucagon
HSQGTFTSDYSKYLDSRRAQDFVQWLMNT
We are now ready to apply what we learned in the previous workshop: we have downloaded (pulled) the docker image and we know how to share a directory… This should be a breeze right?!
We’ll see!
First we need to explore the container to see where we can “attach” (map) the dockershare
directory. In a previous example it was /data
but we need to see if that directory actually exists in this image.
It seems that it could exist if we trust the info from the docker hub page for pegi3s/clustalomega
:
You should adapt and run the following command:
docker run --rm -v /your/data/dir:/data pegi3s/clustalomega
-i /data/sequences.fasta
-o /data/sequences_aligned.fasta
In this command, you should replace:
/your/data/dir
to point to the directory that contains the FASTA file you want to align.sequences.fasta
to the actual name of your FASTA file.sequences_aligned.fasta
to the actual name of your aligned FASTA file.So we should be good:
/your/data/dir
can be $HOME/dockershare
sequences.fasta
.sequences_aligned.fasta
would be the written output.Since we would share the /data
folder henceforth this should work OK.
We can create a temporary container to request the help provided by the program itself. For this purpose we can add -h
or --help
. The program will print help informationon the screen.
Note: At this point we do not request a shared directory and the command can be given from within any directory we are at the moment on the local computer.
For more details, the README
file is available online4.
Below we’ll follow an example based on the sequences we have and inspired by the example on the docker hub page.
We can now run clustalomega
from a container. The example given on the web page adds the name of the files on the docker run
command itself:
Run command.
After completing the task the container exists and we are back to the host computer prompt.
We e can type the alignment file on the screen:
>GIP
YAEGTFISDYSIAMDKIRQQDFVNWLLAQ----
>GLP-1
HAEGTFTSDVSSYLEGQAAKEFIAWLVKGRG--
>GLP-2
HADGSFSDEMNTILDNLAARDFINWLIQTKITD
>glucagon
HSQGTFTSDYSKYLDSRRAQDFVQWLMNT----
The -
represent the gaps. Therefore it worked.
However, if you re-run the same command (using the same file names) you’ll get this error:
Therefore we learn that adding --force
will fix that problem.
Other errors might suggest to look into the help:
The help page is rather long, but we can concentrate on the output file and format information:
We can re-reun the command with a different format which provides a better visual of an alignment. This would be true for the subset clu[stal],msf,phy[lip],selex
.
EXERCISE:
Time permitting you can rerun the commands with one or more of these formats.
Do not forget to add --force
to overwrite the output file.
For example: (written with line continuation mark \
)
docker run -it --rm \
-v $HOME/dockershare:/data pegi3s/clustalomega \
-i /data/sequences.fasta -o \
/data/sequences_aligned.fasta \
--force \
--outfmt=clu
# check output
cat sequences_aligned.fasta
CLUSTAL O(1.2.4) multiple sequence alignment
GIP YAEGTFISDYSIAMDKIRQQDFVNWLLAQ----
GLP-1 HAEGTFTSDVSSYLEGQAAKEFIAWLVKGRG--
GLP-2 HADGSFSDEMNTILDNLAARDFINWLIQTKITD
glucagon HSQGTFTSDYSKYLDSRRAQDFVQWLMNT----
:::*:* .: . :: ::*: **:
EXERCISE 2:
Time permitting create a file (named e.g. sequences2.fasta
for longer protein sequence files as detailed in APPENDIX A for “Protein FASTA for clustalomega.”)
Note: There is very little difference between these files, and using the clu
format for the output will make the results easier to read.
Perhaps this should have been the first thing to do?!
Unless special instructions are given when the docker image is created, it shoud be possible to launch the container and explore its content as we have done in previous workshops.
The simplest way is to request a shell on the command e.g. adding /bin/bash
or /bin/sh
at the end of the docker run
command. For example we ran docker run -it alpine /bin/sh
in the first workshop.
In the same way the following should work, omitting sharing a directory as we only want to explore:
clustal-omega: unexpected argument "/bin/sh"
For more information try: clustalo --help
The container does not let us in… In this case we were able to run the clustalo
software. But it is sometimes useful, or necessary. to dive inside the container… The section below explore these options.
The above problem poses the question of useability of a docker file. In the future we’ll be able to create our own docker files and images but for now we have to rely on existing ones.
As we have done in a previous workshop, the best documentation is to explore the docker container to know what is available. In the above example we assume from the documentation that the /data
directory exists… Since the command worked we can suppose that it exists.
As a quick reminder here is the process to create a container from the beginning, even though for now we only downloaded existing images.
docker build
.)docker push
.)docker pull
.docker run
In order to investigate (and learn useful information at the same time) we have to explore data provided on the Docker hub web page for this image. We will start with the docker file which contains instructions and therefore the blueprint information on the docker image and containers that are derived.
Explore clustalomega docker page.
clustalomega
docker page: https://hub.docker.com/r/pegi3s/clustalomegaAs of this writing this is what it contains:
FROM ubuntu:18.04
RUN apt-get update \
&& apt-get install -y wget make g++ libargtable2-dev
RUN wget http://www.clustal.org/omega/clustal-omega-1.2.4.tar.gz -O /tmp/clustalomega.tar.gz \
&& tar zxvf /tmp/clustalomega.tar.gz -C /opt/ && rm /tmp/clustalomega.tar.gz \
&& cd /opt/clustal-omega-1.2.4/ \
&& ./configure && make && make install
ENTRYPOINT ["clustalo"]
This docker file is rather “simple” in the sense that there are only a few lines. This is what is means (\
is the line-continuation code to stipulate that this is a single line to execute.)
FROM ubuntu:18.04
means that this “generic” version of Ubuntu is used as the starting point to which we add the libraries of software that are needed in the next lines.RUN apt-get update \
update Ubuntu and add a librayRUN wget
download and install clustalomega
.ENTRYPOINT ["clustalo"]
when a container is activated, immediately run the clustalo
program.The last point is the critical one: as soon as the container is activated, the software clustalo
is started and therefore all data provided on the docker run
command line is passed on to the clustalo
program. It follows that ou command docker run -it --rm pegi3s/clustalomega /bin/sh
failed simply because the argument /bin/sh
was given directly to the clustalo
program which did not recognize this as a file name containing sequences as the program expects.
Therefore the problem is that we cannot, at this point, “enter and explore” the container itself because of the ENTRYPOINT
command that is written “in stone” within the docker file and therefore within the docker image and subsequently the docker container.