5.2 Find largest value in matrix

The result file spike_diff.dist can be considered as a matrix of numbers if we except the first line with the single number of sequences and the last column containing the sequence names. In computer programming the fancy name would be “array.”

A simple search with the terms find largest number in array on a popular search engine provides the answer: About 202,000,000 results and therefore many pages with solutions in various programming languages are available: Python, C, C++, javascript, java, swift, ruby and many more.

However, to answer the question “what is the largest number in the array” (or matrix) we can, once more, use a pipeline of simple Unix tools.

Algorithm:

Here is a solution that does not require any programming and calls on very simple, almost “ordinary” Unix tools:

  • remove the first line that contains the number of sequences
  • remove the last column (or its content) that lists the sequence names
  • convert the blanks space between the matrix numbers to return characters
  • sort the resulting single column of numbers numerically, keeping only unique values, and reverse the output order so that the largest is at the top of the results.

And now let’s do it! The following two commands achieve the same goal:

# version 1
sed 1d spike_diff.dist | tr ' ' '\n'| fgrep -v . | sort -u -b -r

# version 2
sed 1d spike_diff.dist  | tr ' ' '\n'| grep -v [A-Z] | sort -u -n -r
  • sed 1d spike_diff.dist: All versions start by deleting the first line (1d) of file spike_diff.dist with the stream editor sed and send the remaining data within the data stream (pipeline.)

  • tr ' ' '\n': convert all blanks into a return character in all versions: this will convert the matrix into a single column of numbers (that can easily be sorted.)

  • In version 1 with command fgrep -v \. we take advantage to the fact that the matrix only contains integer numbers and that sequence names always end with a period followed by a number. Therefore the pattern “.” finds only sequence names and the qualifier -v inverses the pattern and retains only those lines that do not contain the pattern. The command fgrep is a special version of grep for which the command should be written as grep -v \. instead. The “\.” notation “escapes” the dot with help of a back-slash \ as an “actual dot” rather than “any character” since grep uses “regular expressions” to encode patterns. The caveat with this method is that it would not remove any sequence name that does not contain a dot.

  • In version 2 with command grep -v [A-Z] we use the “regular expression” pattern recognition of grep and remove all matching lines with qualifier -v. Here the pattern [A-Z] represents any uppercase letter. To make the case more general and include lower case letters we could add one piped command for lower case letters grep -v [a-z] or combine both with yet another version of grep: egrep -v "[A-Z]|[a-z].

  • sort -u -n -r is a sorting utility for which we add flad to sort unique items -u, sort based on numerical rather than alphabetical properties with -b and reverse the order with r so that the largest number be listed at the top.

  • sort -n: The end of version 3

The result of any of these commands is currently:

9
8
7
6
5
4
3
1
0

To recuperate only the top value we could add head -1.

Even simpler: we don’t in fact need to remove the sequence names if we sort numerically. Version 3 uses the tail command to print only the last (and largest number) located on the last line.

# version 3
sed 1d spike_diff.dist | tr ' ' '\n' | sort -n | tail -n1