Chapter 5 Distance matrix

The sequences are very similar to each other as we could observe in the alignment.

But how many amino acids are different between the various sequences?

Another questions we could askis “what is the largest number of differences amongst all the sequences?”

The calculation of a “distance matrix” could help, and clustalo can calculate such a matrix while performing the alignment.

The qualifier --force is only necessary if the calculation needs to be run multiple times (e.g. when testing) to allow the overwriting of a previous file.

clustalo -i spike_filtered.fa -o spike_filtered_omega.fa -v  \
--distmat-out=spike_filtered_omega.dist \
--full  --force  

Or if using docker: (Windows users can refer to section 4.1 for specific Windows command format.)

docker run -it --rm -v $(pwd):/data -w /data \
pegi3s/clustalomega -i spike_filtered.fa -o spike_filtered_omega.fa -v \
--distmat-out=spike_filtered_omega.dist \
--full --force  

We can look at the text file of the matrix with the following command that will prevent “soft wrapping” of lines:

less -S spike_filtered_omega.dist

Here we print a few truncated lines to explore the format showing the first 4 lines and the first 60 characters of each line:

cut -c 1-60  < spike_filtered_omega.dist | head -4
32
QIU81885.1     0.000000 0.001571 0.001571 0.001571 0.001571 
QIU80913.1     0.001571 0.000000 0.001571 0.001571 0.001571 
QIU81585.1     0.001571 0.001571 0.000000 0.001571 0.001571 

In this case 32 is the number of sequences and is shown alone on the first line. (Current update now has 167 sequences.)

But these numbers are not very useful in themselves.