3.3 Evaluating sequence length

When we downloaded the sequences from NCBI it seemed that many of them had a length of 1273.

All complete sequences are probably of almost the same length but as an exercise we can try to evalulate the length of the first sequence within the filtered file using Unix utilities.

Utilities used:

  • head: shows the first 10 lines or designated amount of lines.
  • sed: stream editor. Used to remove top line(s).
  • tr: used to delete return characters as they would be counted by wc.
  • wc: word count. Provides number of lines, words, characters (including returns.) Option -m only show number of characters.
head -17 spike_filtered.fa  | sed 1d | tr -d '\n' | wc -m
1273

Command Design Note:

The first line provides the name of the sequence and needs to be removed therefore keeping only sequence records of amino acids. Return characters are removed as they would be counted.

Note: On preparing file spike_filtered.fa the last line was sed 1d to remove a return character. Without this step the above command would have to be modified to accommodate the extra line and written as:

head -18 spike_filtered.fa | sed 1,2d | tr -d '\n' | wc -m

In the next section we’ll align sequences.

Note: The original filtered sequence file of 32 sequences is available for download as spike_32.fa which could be used instead of the spike_filtered.fa just generated.