5.2 Find largest value in matrix
The result file spike_diff.dist
can be considered as a matrix of numbers if we except the first line with the single number of sequences and the last column containing the sequence names. In computer programming the fancy name would be “array.”
A simple search with the terms find largest number in array
on a popular search engine provides the answer: About 202,000,000 results and therefore many pages with solutions in various programming languages are available: Python
, C
, C++
, javascript
, java
, swift
, ruby
and many more.
However, to answer the question “what is the largest number in the array” (or matrix) we can, once more, use a pipeline of simple Unix tools.
Algorithm:
Here is a solution that does not require any programming and calls on very simple, almost “ordinary” Unix tools:
- remove the first line that contains the number of sequences
- remove the last column (or its content) that lists the sequence names
- convert the blanks space between the matrix numbers to return characters
- sort the resulting single column of numbers numerically, keeping only unique values, and reverse the output order so that the largest is at the top of the results.
And now let’s do it! The following two commands achieve the same goal:
# version 1
sed 1d spike_diff.dist | tr ' ' '\n'| fgrep -v . | sort -u -b -r
# version 2
sed 1d spike_diff.dist | tr ' ' '\n'| grep -v [A-Z] | sort -u -n -r
sed 1d spike_diff.dist
: All versions start by deleting the first line (1d
) of filespike_diff.dist
with the stream editorsed
and send the remaining data within the data stream (pipeline.)tr ' ' '\n'
: convert all blanks into a return character in all versions: this will convert the matrix into a single column of numbers (that can easily be sorted.)In version 1 with command
fgrep -v \.
we take advantage to the fact that the matrix only contains integer numbers and that sequence names always end with a period followed by a number. Therefore the pattern “.” finds only sequence names and the qualifier-v
inverses the pattern and retains only those lines that do not contain the pattern. The commandfgrep
is a special version ofgrep
for which the command should be written asgrep -v \.
instead. The “\.
” notation “escapes” the dot with help of a back-slash\
as an “actual dot” rather than “any character” sincegrep
uses “regular expressions” to encode patterns. The caveat with this method is that it would not remove any sequence name that does not contain a dot.In version 2 with command
grep -v [A-Z]
we use the “regular expression” pattern recognition ofgrep
and remove all matching lines with qualifier-v
. Here the pattern [A-Z] represents any uppercase letter. To make the case more general and include lower case letters we could add one piped command for lower case lettersgrep -v [a-z]
or combine both with yet another version of grep:egrep -v "[A-Z]|[a-z]
.sort -u -n -r
is a sorting utility for which we add flad to sort unique items-u
, sort based on numerical rather than alphabetical properties with-b
and reverse the order withr
so that the largest number be listed at the top.sort -n
: The end of version 3
The result of any of these commands is currently:
9
8
7
6
5
4
3
1
0
To recuperate only the top value we could add head -1
.
Even simpler: we don’t in fact need to remove the sequence names if we sort numerically. Version 3 uses the tail
command to print only the last (and largest number) located on the last line.