5.1 Convert to number of differences matrix

We could also have calculated the values as a percentage by adding --percent-id. However, instead of 0.001571 we would have for example 99.842891, a value close to 100 as there are very few differences indeed.

In both cases we can make use of the matrix by multiplying these numbers by the sequence length since all sequences are of length 1273.

With the appropriate “rouding” to the next integer we would get the following for default and percent versions of the matrix:

  • 0.001571 * 1273 = 1.999883 which is rounds up to 2
  • (99.842891/100) * 1273 = 1271, and 1271 - 1273 = 2 as well

In other words, if we multiply each number within the matrix we’ll obtain the number of differing amino acids between each pairwise sequence comparison.

The following is an advanced command that I modified based on the answer to a similar question on an Internet forum12. The purpose of the command is to apply the multiplicated example above to all numbers within the matrix.

awk '{ for ( i=2; i<=NF; i++ ) printf int($i*1273) " " } { print $1," " }' \
spike_filtered_omega.dist > spike_diff.dist

Briefly it is a for loop within an awk script:

  • i=2: start with second column. First column contains sequence names
  • i<=NF: as long as i is less than NF (number of fields or columns)
  • i++: then increment i by a value of 1 at each round.
  • printf: is a formatted print output
  • int: is the awk command to “round” numbers
  • $i: represents the row of numbers. All will be multiplied by 1273
  • " ": is part of the printf formatting to add a blank space between each number. There is one blank space between the quotes.
  • { print $1,"" }:
    • " " represents each modified line of the previous section i.e the line of calculated numbers.
    • $1 adds column 1 with sequence names of original matrix. However, it ends up at the end of the line.

The output is a set of small one digit numbers representing the number of amino acid differences between each sequence pair.

A diagonal of 0 values indicate the comparison of files with themselves.

Below is a complete matrix output when the results contained only 32 sequences.

At a glance we could immediately conclude that the sequence that is most different to all others is QHS34546.1 located on the one before last line with 5 or even 6 differences with all other sequences.

# Full matrix when there were only 32 sequences

cat spike_diff.dist
32  
0 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIU81885.1  
1 0 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIU80913.1  
1 1 0 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIU81585.1  
1 1 1 0 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIU80973.1  
1 1 1 1 0 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS61422.1  
1 1 1 1 1 0 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS61338.1  
1 1 1 1 1 1 0 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS61254.1  
1 1 1 1 1 1 1 0 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS60930.1  
1 1 1 1 1 1 1 1 0 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS60978.1  
1 1 1 1 1 1 1 1 1 0 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 QIS60906.1  
1 1 1 1 1 1 1 1 1 1 0 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS60546.1  
1 1 1 1 1 1 1 1 1 1 1 0 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS60489.1  
1 1 1 1 1 1 1 1 1 1 1 1 0 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS60582.1  
3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 1 1 3 1 3 3 3 3 3 3 3 3 3 3 3 6 3 QIS30615.1  
1 1 1 1 1 1 1 1 1 1 1 1 1 3 0 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 QIS30425.1  
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIK50427.1  
3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 0 3 1 3 3 3 3 3 3 3 3 3 3 3 6 3 QIS30295.1  
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 0 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS30335.1  
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 4 1 YP_009724390.1  
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 0 3 1 1 1 1 1 1 1 1 1 5 1 QIS30165.1  
3 3 3 3 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 0 3 3 3 3 3 3 3 3 3 6 3 QIQ49882.1  
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 0 1 1 1 1 1 1 1 1 5 1 QIO04367.1  
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 0 1 1 1 1 1 1 1 5 1 QIC53204.1  
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 0 1 1 1 1 1 1 5 1 QII87830.1  
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 0 1 1 1 1 1 5 1 QHU79173.2  
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 0 1 1 1 1 5 1 QIA98583.1  
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 0 1 1 1 5 1 QIA20044.1  
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 0 1 1 5 1 QIJ96493.1  
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 0 1 5 1 QII57278.1  
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 0 5 1 QHR84449.1  
5 5 5 5 5 5 5 5 5 5 5 5 5 6 5 5 6 5 4 5 6 5 5 5 5 5 5 5 5 5 0 5 QHS34546.1  
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 0 QHZ00379.1  

At last revision the number of sequences retained has increased to 167.

A quick visual inspection of the updated file is easily accomplished with the command less -S to avoid soft wrapping:

less -S spike_diff.dist

This quick glance shows that sequence QHS34546.1 appears to still be the most different. It seems to also contain the higest value of difference (currently 9, stars added) roughly in the middle of the matrix results for this sequence:

6 8 6 6 4 7 7 6 6 6 5 6 6 6 6 6 6 6 6 6 6 5 5 5 6 6 5 6 6 6 6 5 6 6 6 5 6 6 6 6 7 5 6 5 6 6 6 5 6 5 5 6 5 4 5 6 6 6 6 7 6 6 5 6 6 5 5 5 5 5 6 5 7 5 5 6 6 6 6 7 7 6 5 6 6 5 6 6 6 6 6 5 6 7 6 *9* 5 5 5 5 7 6 5 8 6 5 6 1 6 5 5 6 5 5 6 5 6 5 5 5 6 6 5 5 6 7 6 6 6 5 5 6 6 6 6 5 5 5 6 6 5 5 5 5 6 5 7 5 6 5 5 5 5 6 6 5 5 6 5 0 5 5 5 5 5 5 5 QHS34546.1

The sentence above is written with the words appears and seems in order to specify that this statement is based on (human) casual visual inspection.

But can we “automatically” find the highest number of differences without visual inspection and without writing a complicated program?