
5.1 Convert to number of differences matrix
We could also have calculated the values as a percentage by adding --percent-id
. However, instead of 0.001571
we would have for example 99.842891
, a value close to 100 as there are very few differences indeed.
In both cases we can make use of the matrix by multiplying these numbers by the sequence length since all sequences are of length 1273
.
With the appropriate “rouding” to the next integer we would get the following for default and percent versions of the matrix:
0.001571 * 1273 = 1.999883
which is rounds up to2
(99.842891/100) * 1273 = 1271
, and1271 - 1273 = 2
as well
In other words, if we multiply each number within the matrix we’ll obtain the number of differing amino acids between each pairwise sequence comparison.
The following is an advanced command that I modified based on the answer to a similar question on an Internet forum12. The purpose of the command is to apply the multiplicated example above to all numbers within the matrix.
awk '{ for ( i=2; i<=NF; i++ ) printf int($i*1273) " " } { print $1," " }' \
spike_filtered_omega.dist > spike_diff.dist
Briefly it is a for
loop within an awk
script:
i=2
: start with second column. First column contains sequence namesi<=NF
: as long asi
is less thanNF
(number of fields or columns)i++
: then incrementi
by a value of1
at each round.printf
: is a formatted print outputint
: is theawk
command to “round” numbers$i
: represents the row of numbers. All will be multiplied by1273
" "
: is part of theprintf
formatting to add a blank space between each number. There is one blank space between the quotes.{ print $1,"" }
:" "
represents each modified line of the previous section i.e the line of calculated numbers.$1
adds column 1 with sequence names of original matrix. However, it ends up at the end of the line.
The output is a set of small one digit numbers representing the number of amino acid differences between each sequence pair.
A diagonal of 0
values indicate the comparison of files with themselves.
Below is a complete matrix output when the results contained only 32 sequences.
At a glance we could immediately conclude that the sequence that is most different to all others is QHS34546.1
located on the one before last line with 5
or even 6
differences with all other sequences.
32
0 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIU81885.1
1 0 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIU80913.1
1 1 0 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIU81585.1
1 1 1 0 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIU80973.1
1 1 1 1 0 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS61422.1
1 1 1 1 1 0 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS61338.1
1 1 1 1 1 1 0 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS61254.1
1 1 1 1 1 1 1 0 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS60930.1
1 1 1 1 1 1 1 1 0 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS60978.1
1 1 1 1 1 1 1 1 1 0 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 QIS60906.1
1 1 1 1 1 1 1 1 1 1 0 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS60546.1
1 1 1 1 1 1 1 1 1 1 1 0 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS60489.1
1 1 1 1 1 1 1 1 1 1 1 1 0 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS60582.1
3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 1 1 3 1 3 3 3 3 3 3 3 3 3 3 3 6 3 QIS30615.1
1 1 1 1 1 1 1 1 1 1 1 1 1 3 0 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 QIS30425.1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIK50427.1
3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 0 3 1 3 3 3 3 3 3 3 3 3 3 3 6 3 QIS30295.1
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 0 1 1 3 1 1 1 1 1 1 1 1 1 5 1 QIS30335.1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 4 1 YP_009724390.1
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 0 3 1 1 1 1 1 1 1 1 1 5 1 QIS30165.1
3 3 3 3 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 0 3 3 3 3 3 3 3 3 3 6 3 QIQ49882.1
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 0 1 1 1 1 1 1 1 1 5 1 QIO04367.1
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 0 1 1 1 1 1 1 1 5 1 QIC53204.1
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 0 1 1 1 1 1 1 5 1 QII87830.1
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 0 1 1 1 1 1 5 1 QHU79173.2
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 0 1 1 1 1 5 1 QIA98583.1
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 0 1 1 1 5 1 QIA20044.1
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 0 1 1 5 1 QIJ96493.1
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 0 1 5 1 QII57278.1
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 0 5 1 QHR84449.1
5 5 5 5 5 5 5 5 5 5 5 5 5 6 5 5 6 5 4 5 6 5 5 5 5 5 5 5 5 5 0 5 QHS34546.1
1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 5 0 QHZ00379.1
At last revision the number of sequences retained has increased to 167
.
A quick visual inspection of the updated file is easily accomplished with the command less -S
to avoid soft wrapping:
This quick glance shows that sequence QHS34546.1
appears to still be the most different. It seems to also contain the higest value of difference (currently 9
, stars added) roughly in the middle of the matrix results for this sequence:
6 8 6 6 4 7 7 6 6 6 5 6 6 6 6 6 6 6 6 6 6 5 5 5 6 6 5 6 6 6 6 5 6 6 6 5 6 6 6 6 7 5 6 5 6 6 6 5 6 5 5 6 5 4 5 6 6 6 6 7 6 6 5 6 6 5 5 5 5 5 6 5 7 5 5 6 6 6 6 7 7 6 5 6 6 5 6 6 6 6 6 5 6 7 6 *9* 5 5 5 5 7 6 5 8 6 5 6 1 6 5 5 6 5 5 6 5 6 5 5 5 6 6 5 5 6 7 6 6 6 5 5 6 6 6 6 5 5 5 6 6 5 5 5 5 6 5 7 5 6 5 5 5 5 6 6 5 5 6 5 0 5 5 5 5 5 5 5 QHS34546.1
The sentence above is written with the words appears and seems in order to specify that this statement is based on (human) casual visual inspection.
But can we “automatically” find the highest number of differences without visual inspection and without writing a complicated program?