
4.2 Alignment format
Amongst the many sequence and multiple sequence formats available10 for EMBOSS
one of the simplest interleaved format is the “clustal” option.
The EMBOSS
program used to manipulate the format of sequence files is called seqret
. (The list of EMBOSS
“apps” is available online11.)
The format can be specified simply by adding the format name before the sequence file itsef (as a “prefix”,) both separated by a double colon ::
.
For example, assuming that you have EMBOSS
installed locally, the format change from multiple FastA format to the clustal
format would be written as:
To perform this task with a docker container (shown with command continuation symbol \
for clarity.)
docker run -it --rm \
-v $(pwd):/data -w /data \
pegi3s/emboss \
seqret fasta::spike_filtered_omega.fa \
clustal::spike_filtered_omega.clustal
Upon completion the user is returned to the local shell and the container is discarded. (Windows users can refer to section 4.1 above for specific docker for Windows command format.)
Docker Magic
The first 3 lines of the docker run
command above create a new container. The seqret
command and subsequent lines can simply be changed to alternate EMBOSS
commands that are shown below.
Other interleaved sequence format options could be used in the same way, for example msf
, nexus
, phylip
etc.
However, again, upon inspection of the interleaved sequence file, it is still difficult to spot if there is any difference or where differences are located between the files.
The format mega
is useful in this case as only the top sequence will be shown in full, while only differences will be displayed for the remaining sequences.
Here is a command to skip some of the top header to look at the first 5 lines os sequences:
#QIU81885.1 MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHS
#QIU80913.1 .................................................L
#QIU81585.1 ..................................................
#QIU80973.1 ..........................V.......................
#QIS61422.1 ..................................................
This is indeed a useful format visually. In the next section we’ll discover that we can also add a consensus sequence and count the number of amino acid changes.