4.2 Alignment format

Amongst the many sequence and multiple sequence formats available10 for EMBOSS one of the simplest interleaved format is the “clustal” option.

The EMBOSS program used to manipulate the format of sequence files is called seqret. (The list of EMBOSS “apps” is available online11.)

The format can be specified simply by adding the format name before the sequence file itsef (as a “prefix”,) both separated by a double colon ::.

For example, assuming that you have EMBOSS installed locally, the format change from multiple FastA format to the clustal format would be written as:

seqret \
fasta::spike_filtered_omega.fa clustal::spike_filtered_omega.clustal  

To perform this task with a docker container (shown with command continuation symbol \ for clarity.)

docker run -it --rm                      \
-v $(pwd):/data  -w /data                \
pegi3s/emboss                            \
seqret fasta::spike_filtered_omega.fa    \
clustal::spike_filtered_omega.clustal  

Upon completion the user is returned to the local shell and the container is discarded. (Windows users can refer to section 4.1 above for specific docker for Windows command format.)

Docker Magic

The first 3 lines of the docker run command above create a new container. The seqret command and subsequent lines can simply be changed to alternate EMBOSS commands that are shown below.

Other interleaved sequence format options could be used in the same way, for example msf, nexus, phylip etc.

However, again, upon inspection of the interleaved sequence file, it is still difficult to spot if there is any difference or where differences are located between the files.

The format mega is useful in this case as only the top sequence will be shown in full, while only differences will be displayed for the remaining sequences.

seqret fasta::spike_filtered_omega.fa mega::spike_filtered_omega.mega

Here is a command to skip some of the top header to look at the first 5 lines os sequences:

 head -15 spike_filtered_omega.mega | tail -5
#QIU81885.1     MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHS
#QIU80913.1     .................................................L
#QIU81585.1     ..................................................
#QIU80973.1     ..........................V.......................
#QIS61422.1     ..................................................

This is indeed a useful format visually. In the next section we’ll discover that we can also add a consensus sequence and count the number of amino acid changes.