2.3 Checking sequences

There are 2 things we want to see:

1.Are there any “partial” sequences. The complete sequence is 1273 in length. 2. Are there any **X** wihtin the sequence, meaning that the amino acid sequence is not 100% complete as an X represents an unknown or undefined amino acid, due to uncertainty within the nucleotide sequencing.

We can quickly accomplish these tasks with the command grep that recognizes a pattern. In our case the patterns will be either the word partial or a capital X.

Checking for “partial” with fgrep (fast grep as it only works with simple patterns.) We just need to provide the pattern followed by the file name. The -i option makes the command case insensitive. We only print the first three sequence names.

fgrep -i partial spike_raw.fa | head -3
>QJR94977.1 surface glycoprotein, partial [Severe acute respiratory syndrome coronavirus 2]
>QJR93825.1 surface glycoprotein, partial [Severe acute respiratory syndrome coronavirus 2]
>QJR92925.1 surface glycoprotein, partial [Severe acute respiratory syndrome coronavirus 2]

Note: You can also use more or less command to inspect the the complete spike_raw.fa file one screen and scroll one screenfull at a time with space bar, and q to quit. We’ll remove these sequences later for the final set after we check for X and retain only the last 3 lines on the screen with tail:

fgrep -i X spike_raw.fa  | tail -3
LLALHRSYLTPGDSXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSF
NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPXKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFL

Since the name of the sequence does not get retained we do not know which sequence contain the Xs, but there are some for sure.

In the next section we’ll use “piped” command-lines to remove sequences with X and sequences that are “partial”.